Exploring Apple's Stereoscopic Video

Posted: 2024-03-12

Apple's recent introduction of stereoscopic ("3D") video storage format for its visionOS platform, the algorithmic core of Apple Vision Pro glasses, has sparked a surge in interest. Support for this format has begun to appear in various video tools such as encoders, muxers, and more. We have also integrated support for this format into the VTCLab Media Analyzer.

In essence, Apple has opted not to reinvent the wheel and has instead chosen the MV-HEVC (Multiview HEVC) standard as its foundation. This standard dictates storing stereo frames in a single track, as opposed to separate tracks for the left and right eye videos in traditional stereo encoding. This approach improves compression efficiency since images for the left and right eyes often bear strong similarities. Furthermore, this configuration simplifies frame synchronization.

Building upon MV-HEVC, Apple has added several descriptors to facilitate identifying and defining stereoscopic video settings in a more straightforward manner, without delving into intricate MV-HEVC stream parsing. While these descriptors can be used with other codecs besides MV-HEVC, we'll focus on the former for now.

Now, let's delve into the key similarities and differences between 3D video and standard video streams.

  • The container used is the conventional QuickTime / ISOMBFF.

  • The main HEVC headers (VPS / SPS / PPS) of the base layer are still housed in the 'hvcC' box (refer to the image below, highlighted in blue).

  • The main HEVC headers (VPS / SPS / PPS) of the additional layer (for the second eye) are located in the 'lhvC' box (refer to the image below, highlighted in green).

Apple 3d video screenshot 1
  • Frames of the layer 0 and layer 1 are interleaved. A frame from layer 1 always follows its corresponding frame from layer 0. Below, frames for layer 0 (left eye by default) marked with blue arrows and frames for layer 1 (right eye by default) marked with green.
Apple 3d video screenshot 2
  • There is a 'vexu' box ('VideoExtendedUsageBox'), containing an 'eyes' box ('StereoViewBox'), which in turn houses a 'stri' box ('StereoViewInformationBox').

  • 'vexu' and 'eyes' serve as simple containers, acting as wrappers for their child boxes.

  • The 'stri' box contains the actual data, with the most notable fields being:

    • has_right_eye_view
    • has_left_eye_view (self-explanatory)
    • eye_views_reversed: By default, video for the left eye precedes that for the right eye. A value of 1 indicates reverse order.
    • has_additional_views: This flag is set if there are other views besides the left and right eyes (e.g., a central view).
Apple 3d video screenshot 3

As per the 'eyes' box specification, it may contain other boxs like the 'hero' box ("HeroStereoEyeDescriptionBox"), which indicates the designated "hero" eye in stereo vision. If signaled, this suggests the other stereo eye view derives from the specified stereo eye and can be helpful in monoscopic viewing settings.

For further details, refer to the specification here: https://developer.apple.com/av-foundation/Stereo-Video-ISOBMFF-Extensions.pdf.

With the foundation set by Apple and the broader industry's ongoing efforts, the future of stereoscopic video looks promising, offering exciting possibilities for immersive storytelling and interactive media experiences. Stay tuned for more updates and insights as we navigate the ever-evolving landscape of digital media and technology.