In my last post I talked about the history of multi-bitrate streaming and how we got from RTSP and HTTP streaming back to HTTP download as the primary media web distribution mechanism. In this post we’ll take a closer look at how adaptive streaming differs from traditional streaming (i.e. RTSP) and plain progressive download.
Let’s start by first taking a look at RTSP as an example of a traditional streaming protocol. RTSP is defined as a stateful protocol. This means that from the first time a client connects to the streaming server until the time it disconnects from the streaming server, the server keeps track of a client’s state. The client communicates its state to the server by issuing it commands such as PLAY, PAUSE or TEARDOWN (the first two are obvious; the last one is used to disconnect from the server and close the streaming session).
Once a session between the client and the server has been established, the server begins sending down the media as a steady stream of small packets (the format of these packets is known as RTP). The size of a typical RTP packet is 1452 bytes, which means that in a video stream encoded at 1 Mbps each packet only carries roughly about 11 msec of video. In RTSP the packets can be trasmitted either over UDP or TCP transports – the latter is preferred in cases where firewalls or proxies block UDP packets, but can also lead to increased latency (TCP packets get re-sent until received).
An HTTP protocol, on the other hand, is known as a stateless protocol. If an HTTP client requests some data, the server will respond by sending down the data, but it won’t remember the client or its state. Every HTTP request is handled as a completely standalone one-time session.
Windows Media Services supports streaming over both RTSP and HTTP. Now, you may ask yourself, “But if HTTP is a stateless protocol, how can it be used for streaming?” WMS uses a modified version of HTTP known as MS-WMSP, which uses standard HTTP for transfer of data and messages but also maintains session states, thus effectively turning it into a streaming protocol like RTSP/TCP. Windows Media Services has also supported RTSP streaming since 2003 (9 Series), over both UDP and TCP. Its implementation of the protocol is publicly documented as MS-RTSP.
The important things to remember about traditional streaming protocols like RTSP and WMS-HTTP is that:
- The server sends the data packets to the client at a real-time rate only - that is the bit rate at which the media is encoded (i.e. a 500 kbps encoded video is streamed to the client at approx. 500 kbps)
- The server only sends ahead enough data packets to fill the client buffer. The client buffer is typically between 1 and 10 seconds (WMP and Silverlight default buffer length is 5 seconds). This means that if you pause a streamed video and wait 10 minutes – still only ~5 seconds of video will have downloaded to your client in that time.
The other most common form of media delivery on the Web today is progressive download. Progressive download is nothing more than just a plain, ordinary file download from an HTTP web server like IIS or Apache. It is supported by Silverlight, Flash, WMP, and nearly every other media player and platform under the sun. The term “progressive” likely stems from the fact that most player clients allow the media file to be played back while the download is still in progress - before the entire file has been fully written to disk (typically to the browser cache). Clients that support the HTTP 1.1 specification can also seek to positions in the media file that haven’t been downloaded yet by performing byte range requests to the web server (assuming it also supports HTTP 1.1).
The most popular video sharing websites on the Web today almost exclusively use progressive download: YouTube, Vimeo, MySpace, MSN Soapbox – and even the rather misnamed Silverlight Streaming Service. (See my previous blog posts for a list of reasons why HTTP download is becoming increasingly popular in the online delivery of media.)
Unlike streaming servers which rarely send more than 10 seconds of media data to the client at a time, HTTP web servers keep the data flowing until the download is complete. This leads to the user experience that we have by now grown very accustomed to thanks to YouTube – if you pause a YouTube video at the beginning of playback and wait, eventually the entire video will have downloaded to your browser cache, allowing you to smoothly play the whole video without any hiccups. There is a downside to this behavior as well – if 30 seconds into a fully downloaded 10 minute video you decide you don’t like it and quit the video, both you and your content provider have just wasted 9:30 minutes worth of bandwidth. In an effort to mitigate this problem, IIS7 Media Pack 1.0 provides a cool feature called Bit Rate Throttling which allows content providers to throttle the download bitrate in order to reduce costs. But that’s another story…
Speaking of misnomers… Here’s another one: Adaptive Streaming. Guess what? It’s not really streaming in the classic sense at all.
Adaptive streaming is really a hybrid delivery method. It acts like streaming but is in fact based on HTTP progressive download. A more technically accurate name for adaptive streaming might be “A Series of Progressive Downloads of Variable Sized Video Fragments,” but even the most determined marketing experts would have a hard time selling that one. :)
A very important thing to remember about adaptive streaming is that there doesn’t really exist a standard implementation for it today because it’s just an advanced download concept, rather than a new protocol. This is why we talk about both Microsoft Smooth Streaming and Move Networks Adaptive Stream as examples of adaptive streaming, even though they use mutually incompatible codecs, formats and encryption schemes. They both rely on HTTP as the transport protocol and perform the media download as a long series of very small progressive downloads, rather than one big progressive download.
In a prototypical adaptive streaming implementation, the video/audio source is cut up into many short segments (“chunks”) and encoded to the desired delivery format. Chunks are typically 2-4 seconds in length. On the video codec level this typically means that each chunk is cut along video GOP boundaries (each chunk starts with a key frame) and has no dependencies on past or future chunks/GOPs. This allows every chunk to later be decoded completely independently from other chunks.
The encoded chunks are then hosted on a regular HTTP web server. A client requests the chunks from the web server in a linear fashion and downloads them using plain HTTP progressive download. As the chunks are downloaded to the client, the client plays back the sequence of chunks in linear order. Because the chunks were carefully encoded without any gaps or overlaps between them, the chunks play back as a seamless video.
The “adaptive” part of the solution comes into play when the video/audio source is encoded at multiple bitrates, generating multiple chunks of various sizes for each 2-4 seconds of video. The client now has the option to choose between chunks of different sizes. Because web servers usually deliver data as fast as network bandwidth allows them, the client can easily estimate user bandwidth and decide to download bigger or smaller chunks ahead of time. The size of the playback/download buffer is fully customizable.
Adaptive streaming, like other forms of HTTP delivery, offers the following advantages to the content provider:
- It’s cheaper to deploy because adaptive streaming can utilize any generic HTTP caches/proxies and doesn’t require specialized servers at every node
- It offers better scalability and reach, reducing “last mile” issues because it can dynamically adapt to inferior network conditions as it gets closer to the user’s home
- It lets the audience adapt to the content, rather than requiring the content providers to guess which bitrates are most likely to be accessible to their audience
It also offers the following benefits for the end user:
- Fast start-up and seek times because start-up/seeking can be initiated on the lowest bitrate before moving up to a higher bitrate
- No buffering, no disconnects, no playback stutter (as long as the user meets the minimum bitrate requirement)
- Seamless bitrate switching based on network conditions and CPU capabilities
- A generally consistent, smooth playback experience
Microsoft Adaptive Streaming Prototype: NBC Olympics 2008
We first prototyped an implementation of HTTP-based adaptive streaming as part of the NBC Olympics 2008 project. In order to deliver the desired level of quality in a short period of time, we took the most basic adaptive streaming implementation approach. We had NBC’s Digital Rapids and Anystream encoders produce multiple WMV files of different bitrates/resolutions for each source. The encoders didn’t employ any new encoding tricks but merely followed strict encoding guidelines (closed GOP, fixed length GOP, VC-1 entry point headers) which ensured exact frame alignment across the various bitrates of the same video. Then we ran the WMV files through a post-processing tool which physically split each WMV file into thousands of 2-second chunks (files). The rest of the solution consisted of simply uploading the chunks to Limelight’s origin web servers (running Apache) and then building a Silverlight player that would download the chunks and play them in sequence. Simple!
The good news: Our implementation worked great for the end users. We were able to offer a better-than-WMS streaming experience while using just simple HTTP download!
The bad news: CDN operators lost many hours (days?) managing millions of tiny files in their systems. Imagine: if every 2-seconds of video is split into a separate file and this is repeated for 5 available bitrates, you end up with 150 files for every minute of video. That’s 13,500 files for a 90-minute soccer game!
So despite NBC Olympics being a huge success for Silverlight and HTTP-based adaptive streaming, it quickly became apparent we had to go back to the drawing board on this one.
At Last! Smooth Streaming!
The IIS Media team soon took charge of turning the NBC Olympics adaptive streaming solution into a real server product. Its official name – IIS7 Smooth Streaming, an extension for Internet Information Services 7.0.
The IIS Media team redesigned the content creation and delivery aspect of the prototype solution in order to fix the file management issues while still keeping all the advantages of the original solution. The new design eschewed the one-file-per-chunk approach in favor of a single contiguous file for each encoded bitrate. The file format of choice: MPEG-4.
Smooth Streaming server uses the MPEG-4 Part 14 (ISO/IEC 14496-12) file format as its disk (storage) and wire (transport) format. Specifically, the Smooth Streaming specification defines each chunk/GOP as an MPEG-4 Movie Fragment (moof) and stores it within a contiguous MP4 file for easy random access. One MP4 file is expected per each bitrate. When the client requests a specific source time segment from the IIS server, the server dynamically finds the appropriate Movie Fragment box within the contiguous MP4 file and sends it over the wire as a standalone file, thus ensuring full cacheability downstream.
In other words, with Smooth Streaming file chunks are created virtually upon client request, but the actual video is stored on disk as a single full-length file per encoded bitrate.
Smooth Streaming Availability
Smooth Streaming server support will ship as part of the next edition of IIS7 Media Pack, a free download for Windows Server 2008. A technology preview of IIS7 Smooth Streaming Server is already available now through Akamai. They are calling this service Akamai AdaptiveEdge Streaming for Microsoft Silverlight. A demo of the service is available at http://www.smoothhd.com.
On the content creation end, creation of on-demand Smooth Streaming-compatible video is already possible with the latest Expression Encoder 2 SP1. Note that you’ll need to purchase the full version of Expression Encoder 2 in order to get Smooth Streaming encoding support – it’s not included in the “Express” trial version. As a helper tool for encoding to multiple-bitrate formats such as Smooth Streaming, I recommend my Smooth Streaming Calculator. (More about Smooth Streaming encoding with Expression Encoder to come soon.)
In addition, we are already working with a number of encoding ISVs on enabling support for the Smooth Streaming format in their professional encoding products.
Smooth Streaming Playback:
You probably already know that Silverlight 2 supports playback of Smooth Streaming sources (if you don’t, go to http://www.smoothhd.com). But how does it do it?
Despite popular belief, Silverlight doesn’t actually feature native support for any particular adaptive streaming technology – Microsoft’s, Netflix’s or Move Networks’ for example. Smooth Streaming support in Silverlight is implemented via the MediaStreamSource API. This API allows developers to implement their own media transport methods (instead of relying on MediaElement‘s native transport methods) while still leveraging Silverlight’s native decoders and renderers. In other words, Silverlight support for Smooth Streaming is provided entirely in .NET code: the parsing of the MPEG-4 file format, the HTTP download, the bitrate switching heuristics, etc. This allows developers to modify and fine-tune the client adaptive streaming code as needed, instead of waiting for the next Silverlight release and hoping it magically fixes every customer scenario.
The most challenging part of Smooth Streaming Silverlight client development is the heuristics module which determines when and how to switch bitrates. Elementary stream switching functionality requires the ability to swiftly adapt to changing network conditions while never falling too far behind, but that’s often not enough to deliver a great experience. One must also consider: What if the user has enough bandwidth but doesn’t have enough CPU power to consume the high bitrates/resolutions? What happens when the video is paused or hidden in the background (i.e. minimized browser window)? What if the resolution of the best available video stream is actually larger than the screen resolution, thus wasting bandwidth? How large should the download buffer window be? How does one ensure seamless rollover to new media assets such as ads? As any web application developer will tell, you there’s much more to building a good player than just setting a source URL for the media element.
Fortunately for those who prefer not to write such code from scratch, there are already two options available for adding Smooth Streaming support to your Silverlight application:
- Expression Encoder 2 SP1 templates
Every Silverlight 2 player template included with Expression Encoder 2 SP1 includes a ready Smooth Streaming module as well as complete source code (which can be modified and used freely). The Smooth Streaming object (named AdaptiveStreaming.dll) can be easily integrated into any Silverlight project. See James Clarke’s blog for additional Expression Encoder tips & tricks.
- Open Video Player (OVP)
The Akamai-led Open Video Player Initiative is an open-source community project that strives to provide a best-of-breed video player platform for Silverlight and Flash. The Silverlight version of the Open Video Player provides integrated support for Smooth Streaming playback, and is in fact the video player used by Akamai on SmoothHD.com and many of their customer sites.
These templates provide great out-of-the-box Smooth Streaming experiences while also allowing developers to continue innovating and fine-tuning the client code.
In my next blog post: The Smooth Streaming Format