Which fragmented MP4 format would you use, and why?

We all know and love MP4 files. Whether referred to as ISOBMFF, or CMAF, or other closely related terms that essentially refer to variations of the same container format, they are at the core of any modern streaming solution, and in particular adaptive streaming (or ABR) that relies on the fragmentation of the container into small and easily downloadable “segments” of content that the player can gradually obtain and render.

But when it comes to fragmenting, there are 2 main flavours out there (and tons of confusing terminology, in particular the use of “fragmented MP4” (fMP4) to refer to both).

  1. Multiple MP4 files, with one file per segment (or “hard-parted” as a customer recently called it - not accepted terminology, but I kinda like it).
  2. A single MP4 file, internally fragmented (or “soft-parted” by contrast) - In this case, delivery of segments relies on the origin service supporting byte-range requests to retrieve the relevant fragments.

At Bitmovin, we support both containers (or “muxings” as we call these things):

  1. fMP4 muxing for hard-parted outputs
  2. MP4 muxing with fragmentDuration set to a non-null value, for soft-parted outputs.

There is a very small speed advantage in using the segmented / hard-parted approach due to the way our encoder works (parallelised encoding), but otherwise both offer the same set of features (eg. DRM, etc.)

So, from the perspective of encoding and playback, both formats are essentially functionally equivalent (with our products at least), but the rest of the workflow may dictate that you choose one over the other.

Why would you use one over the other? Our community would love to hear about your experience on this!

  • I would use multiple MP4 files (“hard-parted”)
  • I would use a single MP4 file (“soft-parted”)

0 voters

7 Likes

I want to add here that if you use MP4 muxing with soft-parted outputs, that is, providing a fragmentDuration set to a non null value, this fragmentDuration will always supersede the GOP settings if there is a conflict.

Example:

You encode into a target frame rate of 25 fps. You set fragmentDuration to 4000ms, aka 4 seconds. This means, each fragment will contain 100 frames.

If you also set the encoding to have strictly closed GOPs of 48 frames (minGop=48, maxGop=48, closedGop=false) we have a problem because there is no integer number of 48 frame GOPs fitting into a fragment of 100 frames.

The encoder will solve the problem by superseding fragment duration over GOP size. It will create two 48-frame GOPs and one 4-frame GOP for each fragment of 100 frames. Thus, the encoded output will be soft-parted into 4-second-fragments, but the configured strict 48-frame-GOP structure will not be adhered because it does not fit the fragment size.