In classical rate control, all GoPs are treated equally, and frame-level bit allocation is basedon the frame type and a complexity measure, sometimes with multiple passes over the video, but without considering the semantics of picture content. In the H.264/AVC reference encoder, the GoP borders are determined according to a predefined pattern of frames, and the same target bit rate is used for each GoP given the available channel rate. As a result,the video quality varies from GoP to GoP depending on the video content. The problem with this approach is that in some applications, e.g., wireless video, the total bit budget is not sufficient to encode the entire content at an acceptable quality. Video segments with high motion and/or small details may become unacceptable when all GoPs are encoded at the same low rate. Inter-GOP rate control schemes, that is variation of the target bitrate from GoP-to-GoP, have been proposed to offer uniform video quality over the entire video. For example, an optimal solution for the buffer constrained adaptive quantization problem is formulated.The rate-distortion characteristics of the encoded video are used to find the frame rate and quantization parameters that provide the minimum distortion under rate constraints. The minimization operation is done in an iterative manner so thatthe measured distortion is smaller than the previous iteration at each step. However, these methods do not consider the semantics of the video content either in GoP definition or in GoP target bitrate allocation.
As the available computing power at the encoders increases, so does the level of sophisti-cation of the encoders and their associated control techniques. By using appropriate content analysis, it is now possible to define GoPs according to shot boundaries, and allocate target bit rates to each GoP based on the shot type considering the “relevance” or “semantics” of each type of shot. Such a rate control scheme will be called “content-adaptive rate control.” In content-adaptive rate control, video will be encoded according to a pre-specified or user defined relevance-distortion policy. In effect, we accept a priori that some losses are goingto occur due to the high compression ratios needed, and we force these losses to occur in less relevant parts of the video content. We note that “relevance of the content” is highly context (domain) dependent. For example, in the context of a soccer game, the temporal video segments showing a goal event and the spatial segments around the ball are definitely more important than any other part of the video. There are a variety of other domains,such as other sports videos and broadcast news, where the relevance of the content can easily be classified. In content adaptive video coding, temporal segmentation policy used has a major effect on the overall efficiency and rate distribution among temporal segments. There exist techniques for automatically locating such content. A summarization of the available multimedia access technologies that support Universal Multimedia Access (UMA) is presented. Segmentation and summarization of audio-video content are discussed in detail and the transcoding techniques for such content are demon-strated.
Content adaptive rate allocation ideas have been introduced in the literature before.The input video is segmented and encoded as two streams for different relevance levels with “predetermined bit rates,” namely, the high target bitrate (highly relevant) and the low target bitrate (less relevant) streams. The less relevant shots are then encoded such that they are shown as still images at the receiving side and the more important shots are encoded at full quality. In this pioneering work, the decision to restrict the number ofthe relevance levels to two and the determination of the relative bit allocations are donein an ad-hoc manner. Quality of Service (QoS) is required for continuous playback to be guaranteed and low and high rates are determined by the client buffer size and the channel bandwidth. The server buffer size required is set afterwards, which effectively determines the pre-roll delay.
There are also techniques that divide the input video into segments by considering vari-ous statistics along these segments that affect the ease of coding without taking into accountany relevance issues. For example, MPEG-7 metadata are used for video transcoding for home networks. Concepts like “difficulty hints” and “motion hints” are described. Difficulty hints are a kind of metadata that denotes the encoding difficulty of the given content.The motion hints describe the motion un-compensability metadata, which contains infor-mation about the GOP structure, frame rate and bitrate control and also the search range metadata that reduces the complexity of the transcoding process. In this work, boundaries of the temporal segments of the content are determined by the points where the motionun-compensability metadata makes a peak and then the video is transcoded using the difficulty hints. Here, GOP size is varied according to the motion un-compensability metadata. A hybrid scaling algorithm using a quality metric based on the features of the human visual system is introduced in, which tries to make full utilization of the communication channel by scaling video in either temporal or spatial dimensions. In this work, frame rate ofthe encoded video is reduced at scenes where motion jitter is low (high temporal resolution)and all the frames are kept for scenes with high motion at the expense of reduced spatial resolution.