Originally published at: https://bitmovin.com/vvc-open-gop-resolution-switching/
At IBC 2023 the Fraunhofer HHI, Spin Digital and Bitmovin are presenting a paper on the practical application of a new feature that was introduced in VVC: Open-GOP resolution switching. In this blog post I want to explain what open-GOP prediction is, what the benefits are and why with VVC open-GOP prediction can finally be used in adaptive streaming.
Table of Contents
Closed-GOP prediction structure
Let us first look at a conventional closed-GOP prediction structure. Since nothing was decoded yet, the first frame in a bitstream is an Instantaneous Decoding Refresh (IDR) frame. If an IDR frame is received, the decoder is instantaneously reset (refreshed) and all frame buffers or other internal buffers are cleared. Since the frame has no dependencies on other frames, it can always be decoded. An IDR frame is also a Random Access (RA) point or keyframe. At RA points decoding can be started. (RAs are marked in orange).
The following frames are then encoded using predictive (P) coding. This means that they use data from the already decoded frames. This includes pixel data for motion compensation but also motion vectors or prediction modes. But let’s illustrate this with an example:
In this example we are encoding a total of 9 frames. The frames are marked from 0-9 in the order that they are displayed to the viewer. (The vertical offset of the uneven frame numbers is just for illustration purposes.) However, they are not encoded in the order that they are displayed. In this example, frame 2 uses only frame 0, which has been decoded already, for prediction. Next, frame 1 is decoded which is displayed between frames 0 and 2 and uses both frames for prediction. This so-called Bi-prediction is much more efficient than prediction only from frames in the temporal past and is a key feature that makes modern video codecs so efficient.
Of course, it is impractical to have only one keyframe at the very beginning of a video. We also want to be able to start decoding at frequent points within a bitstream. This allows us to seek in a video as well as to switch between different renditions as it is done in adaptive streaming. So, we can just insert multiple IDR frames in a video:
In this example, frame 4 is also an IDR frame. We can start decoding at frame 0 as well as at frame 4. Frames 0-3 form a Group of Pictures (GOP) which is completely self-contained and can be decoded completely independently of any other GOPs. The same is true for the following GOP of frames 4-9. As there are no dependencies between these GOPs, this is also referred to as a closed-GOP configuration.
The closed-GOP configuration is widely used in adaptive bitrate streaming applications where the ubiquitous approach is to split the video into segments of a certain length. Each segment is then encoded using a predetermined set of different resolutions and bitrates called renditions. Since every segment starts with an IDR frame, it is possible to start decoding at each segment which therefore enables seeking. Furthermore, the video player can also freely switch to any of the other renditions at every segment boundary.
Another benefit emerges at the encoder side where each long video is split into small pieces (segments). If these segments can be independently decoded, then they can also be independently encoded. And if we mention “many segments” and “independently encodable” then the next thought is “scalability” and “cloud compute”. And this is exactly the principle that the Bitmovin cloud encoder is based on. We take all these individual encoding tasks and then scale horizontally in the cloud.
Open-GOP prediction structure
So the opposite of a closed-GOP is an open-GOP configuration. The key difference is that in an open-GOP prediction structure predictions between the GOPs are allowed. Let’s again look at an example:
So, the frames 0, 1 and 2 are decoded in a hierarchical fashion as before. But then something different happens. The next frame in decoding order is frame 4 which is a random access point (RA). However, it is not an IDR but a Clean Random Access (CRA) point. While a CRA can also be decoded independently of any other frame, it does not reset the decoder as an IDR does and the reference picture buffer is not cleared. Next, we have frame 3 in coding order. As before, this frame uses frames 2 and 4 as reference. The rest of the frames are coded as before.
As in the closed-GOP example we can start the decoding process at frame 4 because it is not using any other frames as a reference. But this time, the decoder cannot be reset if this frame is received because the following frame (frame 3) uses previously decoded frames as a reference which as a result must remain in the picture buffer. The process of starting decoding from the second GOP is therefore a bit more complex:
Frame 4 is a Clean Random Access (CRA) point so decoding can be started with this frame. For the next frame in coding order (frame 3) we now have an issue. Since one of its references (frame 2) has not been decoded, frame 3 cannot be decoded. If we start decoding at the Random Access (RA) point of frame 4, the decoding of the leading picture 3 must be skipped. Consequently, the frame type is Random Access Skipped Leading (RASL). The remaining frames can be decoded as before.
So, what are the advantages of an open-GOP configuration? So far, we just observed that decoding is more complicated. Moreover, it is now also impossible to switch to a different rendition at the CRA because we have not decoded the reference frames that are needed for the leading frames of the open-GOP, and we will have to skip decoding them. But there are two substantial advantages:
Coding performance
As I already mentioned, bidirectional prediction into the temporal past and future is one of the key features that make modern video codecs so efficient. Generally, the more past and future reference frames a frame can use for prediction, the higher the compression efficiency. Frames that do not use any other frames as reference (RA frames like IDR and CRA frames) typically have the worst compression efficiency.
While we cannot avoid having regular random access points in the bitstream for seeking, we can increase the coding efficiency of the leading pictures significantly in the open-GOP configuration. This leads to a significant reduction in overall bitrate at the same quality. In the experiments from the HHI, an overall BD-rate reduction of up to 9% could be observed. Of course, these results depend on many factors like the general coding structure, the resolution and bitrate as well as the content itself.
Coding performance across segment boundaries
In a closed-GOP configuration, the decoder must be reset with every IDR frame. An unwanted side effect of this is that the quality as well as the visual representation of a scene changes very abruptly at this point. Especially at lower bitrates, this can be perceived as a sudden jump or pumping in the video. Things that are generally hard to encode like water, clouds and trees are particularly susceptible to this effect.
In this example, the difference between closed-GOP on the left and open-GOP on the right is nicely visible. The pumping is especially notable in the exhaust clouds of the rocket launch in the first scene and in the trees in the background in the second scene. In the open-GOP configuration this effect is hardly visible.
Open-GOP resolution switching
I mentioned before that switching to a different rendition is only possible at IDR frames in a closed-GOP configuration. We also saw that if we start decoding a CRA frame with RASL frames, we must skip decoding of the leading frames. Obviously, we don’t want to skip decoding frames whenever the player switches to a different resolution. This would be a horrible experience for the viewer.
Fortunately, VVC has a trick up its sleeve for exactly this scenario. In the example above we noted that decoding of the RASL frame (frame 3) is not possible because it uses frame 2 as a reference which has not been decoded when switching renditions. But what has been decoded is a different version of frame 2 from a different rendition. While this frame may have been decoded at a different quality or even at a different spatial resolution it is a representation of the exact same frame. So, with a bit of high level syntax, the VVC decoder can use this frame from another rendition as a reference frame for decoding frame 3. Even if the frame uses a different resolution, the decoder has a standardized set of up/down scaling filters. Let’s look at this:
In this example we are decoding frames 0 to 2 from a rendition at a lower resolution. Then the player decides to switch to a rendition with a higher resolution and bitrate. Decoding of the CRA (frame 4) is no problem since RA frames can be decoded independently of other frames. For frame 3, the decoder will now upscale frame 2 from the lower rendition and use this frame as a reference instead of the unavailable frame from the higher rendition. Decoding of the remaining frames is unchanged.
As mentioned before, the open-GOP prediction structure significantly reduces quality pumping effects. But there is another advantage. When switching to a higher or lower rendition in a closed-GOP configuration, there is a visible jump of the quality of the video. Of course, the bigger the jump is, the more pronounced the visible quality jump becomes. However, in open-GOP resolution switching, the intermediate leading frames that are using references from both renditions act as a sort of “quality interpolation” between the renditions which results in a much smoother transition between the renditions.
IBC 2023
At the IBC, we are presenting a technical paper about practical implementations and considerations when implementing open-GOP resolution switching with VVC in real world environments. This is a joint effort of Fraunhofer HHI, Spin Digital and Bitmovin. Please join us at the IBC in the “Advances in video coding and processing” session on Sep 16th starting at 14:15 in room E102. Here we will present what technical challenges arise when deploying this feature for low latency live transcoding as well as in the highly scalable Bitmovin cloud encoder.