Smooth Streaming Architecture
February 10th, 2009 Posted in H.264, Internet Information Services, Silverlight, Smooth StreamingAs described in the previous two posts, Smooth Streaming is Microsoft’s implementation of HTTP-based adaptive streaming, which is a hybrid media delivery method. It acts like streaming, but is in fact based on HTTP progressive download. The HTTP downloads are performed in a series of small chunks, allowing the media to get easily and cheaply cached along the edge of the network, closer to the end users. Providing multiple encoded bitrates of the same media source also allows Silverlight clients to seamlessly and dynamically switch between bitrates depending on network conditions and CPU power. The resulting end user experience is one of reliable, consistent playback without stutter, buffering or “last mile” congestion. In one word: Smooth.
In this post we’ll take a closer look at how Smooth Streaming works: format, server, and client.
Smooth Streaming Format
Smooth Streaming is the first Microsoft media format in over a decade to use a file format other than ASF. It is based on the ISO/IEC 14496-12 ISO Base Media File Format specification, better known as the MP4 file specification. Why MP4 and not ASF? Well, there are several reasons:
- MP4 is a lightweight container format with less overhead than ASF
- MP4 is easier to parse in managed (.NET) code than ASF
- MP4 is based on a widely used standard, making 3rd party adoption and support more straightforward
- MP4 was architected with H.264 video codec support in mind, and we’re counting on H.264 support in Smooth Streaming and Silverlight 3 (ASF can also contain H.264 video, but it’s not as straightforward as with MP4)
- MP4 was designed to natively support payload fragmentation within the file
There are actually 2 parts to the Smooth Streaming format: the wire format, and the disk file format. In Smooth Streaming a video is recorded in full length to the disk as a single file (one file per encoded bitrate), but it’s transfered to the client as a series of small file chunks. The wire format defines the structure of the chunks that get sent by IIS to the client, whereas the file format defines the structure of the contiguous file on disk. Fortunately, the MP4 specification allows MP4 to be internally organized as a series of fragments, which means that in Smooth Streaming the wire format is a direct subset of the file format.
What are these MP4 “fragments” that I speak of? The basic unit of an MP4 file is called a “box.” These boxes can contain both data and metadata. The MP4 specification allows for various ways to organize data and metadata boxes within a file. In most media scenarios it is considered useful to have the metadata written before the data so that a player client application can have more information about the video/audio it’s about to play before it plays it. However, in live streaming scenarios it is often not possible to write the metadata upfront about the whole data stream because it’s simply not fully known yet. Furthermore, less upfront metadata means less overhead, which can lead to shorter startup times. For these reasons the MP4 ISO Base Media File Format specification was designed to allow MP4 boxes to be organized in a fragmented manner, where the file can be written “as you go” as a series of short metadata/data box pairs, rather than one long metadata/data pair. The Smooth Streaming file format heavily leverages this aspect of the MP4 file specification, to the point where at Microsoft we often interchangeably refer to Smooth Streaming files as “Fragmented MP4 files” or “(f)MP4.”
Here is a high-level overview of what a Smooth Streaming file looks like on the inside:
In a nutshell, the file starts with file-level metadata (‘moov‘) which generically describes the file, but the bulk of the payload is actually contained in the fragment boxes which also carry more accurate fragment-level metadata (‘moof‘) and media data (‘mdat‘). (The diagram only shows 2 fragments, but a typical Smooth Streaming file has a fragment per every 2 seconds of video/audio.) Closing the file is a ‘mfra‘ index box which allows easy and accurate seeking within the file.
When a Silverlight client requests a video time slice from the IIS Smooth Streaming server, the server simply seeks to the approriate starting fragment in the MP4 file and then lifts the fragment out of the file and sends it over the wire to the client. This is why we refer to the fragments as the “wire format.” This technique greatly enhances the efficiency of the IIS server because it requires no remuxing or rewriting overhead.
Here is what an MP4 fragment looks like in more detail:
We say that the Smooth Streaming format is based on the MP4 file format because even though we’re following the ISO specification, we specify our own box organization schema and some custom boxes. In order to differentiate Smooth Streaming files from “vanilla” MP4 files, we use new file extensions: *.ismv (video+audio) and *.isma (audio only). I keep forgetting to ask the IIS Media team what the acronyms exactly stand for, but my best guess would be “IIS Smooth Streaming Media Video (Audio)”.
Smooth Streaming Media Assets
A typical Smooth Streaming media asset therefore consists of the following files:
- MP4 files containing video/audio
- *.ismv – contains video and audio, or only video
- 1 ISMV file per encoded video bitrate
- *.isma - contains only audio
- In videos with audio, the audio track can be muxed into an ISMV file instead of a separate ISMA file
- *.ismv – contains video and audio, or only video
- Server manifest file
- *.ism
- Describes the relationships between media tracks, bitrates and files on disk
- Only used by the IIS Smooth Streaming server – not by client
- Client manifest file
- *.ismc
- Describes to the client the available streams, codecs used, bitrates encoded, video resolutions, markers, captions, etc.
- It’s the first file delivered to the client
Both manifest file formats are based on XML. The server manifest file format is based specifically on the SMIL 2.0 XML format specification.
A folder containing a single Smooth Streaming media asset might look something like this:
In this particular case the audio track is contained in the NBA_3000000.ismv file.
Smooth Streaming Manifest Files
The Smooth Streaming Wire/File Format specification defines the manifest XML language as well as the MP4 box structure. Because the manifests are based on XML they are highly extensible. Among the features already included in the current Smooth Streaming format specification is support for:
- VC-1, WMA, H.264 and AAC codecs
- Text streams
- Multi-language audio tracks
- Alternate video and audio tracks (i.e. multiple camera angles, director’s commentary, etc.)
- Multiple hardware profiles (i.e. same bitrates targeted at different playback devices)
- Script commands, markers/chapters, captions
- Client manifest Gzip compression
- URL obfuscation
- Live encoding and streaming
For an example of a Smooth Streaming On-Demand Server Manifest file, see here.
For an example of a Smooth Streaming Client Manifest file, see here.
Smooth Streaming Playback: Bringing It All Home
Microsoft’s adaptive streaming prototype (used for NBC Olympics 2008) relied on physically chopping up long video files into small file chunks. In order to retrieve the chunks for the web server, the player client simply needed to download files in a logical sequence: 00001.vid, 00002.vid, 00003.vid, etc.
As I’ve explained in this and previous posts, Smooth Streaming uses a more sophisticated file format and server design. The videos are no longer split up into thousands of file chunks, but are instead “virtually” split up into fragments (typically 1 fragment per video GOP) and stored within a single contiguous MP4 file. This implies two significant changes in server and client design too:
- The server must be able to translate URL requests into exact byte range offsets within the MP4 file, and
- The client can request chunks in a more developer-friendly manner, such as by timecode instead of by index number
The first thing a Silverlight client requests from the Smooth Streaming server is the *.ismc client manifest. The manifest tells it which codecs were used to compress the content (so that the Silverlight runtime can initialize the correct decoder and build the playback pipeline), which bitrates and resolutions are available, and a list of all the available chunks and either their start times or durations.
With IIS7 Smooth Streaming, a client is expected to request fragments in the form of RESTful URLs:
http://video.foo.com/NBA.ism/QualityLevels(400000)/Fragments(video=610275114) http://video.foo.com/NBA.ism/QualityLevels(64000)/Fragments(audio=631931065)
The values passed in the URL represent encoded bitrate (i.e. 400000) and the fragment start offset (i.e. 610275114) expressed in an agreed-upon time unit (usually 100 ns). These values are known from the client manifest.
Upon receiving a request like this, the IIS7 Smooth Streaming component looks up the quality level (bitrate) in the corresponding *.ism server manifest and maps it to a physical *.ismv or *.isma file on disk. It then goes and reads the appropriate MP4 file, and based on its ‘tfra’ index box figures out which fragment box (‘moof’ + ‘mdat’) corresponds to the requested start time offset. It then extracts the said fragment box and sends it over the wire to the client as a standalone file. This is a particularly important part of the overall design because the sent fragment/file can now be automatically cached further down the network, potentially saving the origin server from sending the same fragment/file again to another client requesting the same RESTful URL.
As you can see, requesting chunks of video/audio from the server is easy. But what about dynamic bitrate switching that makes adaptive streaming so effective? This part of the Smooth Streaming experience is implemented entirely in client-side Silverlight application code – the server plays no part in the bitrate switching process. The client-side code looks at chunk download times, buffer fullness, rendered frame rates, and other factors – and based on them decides when to request higher or lower bitrates from the server. Remember, if during the encoding process we ensure that all bitrates of the same source are perfectly frame aligned (same length GOPs, no dropped frames), then switching between bitrates is completely seamless – and Smooth.
In my next blog post: Encoding For Smooth Streaming



30 Responses to “Smooth Streaming Architecture”
By Stenka on Feb 24, 2009
Hi Alex and thanks for the informative posts.
Id like to inquire is it possible to get “near realtime” streaming with WindowsMedia?
Ive been doing this WMencoding for a while and the fastest connections seems to be straight to the encoder where ive achieved only few seconds.
On the other hand we have used LiveMeeting also and the RealTime experience is almost there (~1 sec to here from westcost U.S, i presume).
Would you elaborate in this on how would you go about achieving the fastest connection to the client, please.
Thanks
By Alex Zambelli on Feb 26, 2009
Hi Stenka,
Latency in live Windows Media streaming can be reduced to 2-4 seconds with the right set up. Ben Waggoner’s blog post – http://on10.net/blogs/benwagg/Low-Latency-webcasting-with-Windows-Media-and-Siverlight/ – explains how to configure the encoder, WMS server and client (WMP or Silverlight) to deliver low latency streams. Note, however, that adding WMS proxies and CDNs into the workflow will add additional latency.
Windows Media wasn’t really designed for real-time communications, so it’s not possible to reduce latency below 2-4 seconds.
Live Meeting was designed for realtime communications. Realtime communications tools such as Office Communicator and Live Meeting are specifically tuned for those scenarios.
By afh3 on Mar 25, 2009
Alex,
I’m not having much luck figuring out how to read the metadata (present during the Expression Encoder2 SP1 session) back out of smooth-streaming files it creates. Using the Expression Encoder 2 SP1 SDK, I can get the metadata back out of pretty much every kind of media-file produced by the encoder, — except the ismv types.
I’ve got a nice little Silverlight 2 video player built, but I need to provide things like the title and description for each playlist item in the initparms as strings that I generate myself – as opposed to being able to pull these very same fields from the ismv files that were created from encoder sessions that already had those values set as metadata in the job.
What I have is a common dump-folder for the output of Expression Encoder 2 (SP1) smooth-streaming files that resides in an IIS7 site configured to stream them to my player app. It all works perfectly as it is, but I would really like to be able to automate the very last piece which would be to have my player’s playlist item’s title and description properties extracted from the ismv files. That way, whatever values the encoding-user put into the metadata on the encoder-job would appear in the gallery panel of my Silverlight player. If I had this, I could build the custom initparms to pass to my Silverlight usercontrol and it would be displayed next to the thumbnails just like it is now – only I wouldn’t have to constantly update an XML file (as I do now) with that information every time a new file gets encoded.
Can this be done?
I know I could do this by creating a front-end for the encoder with the SDK, and then writing those metadata values to my existing XML-file, but I’d rather let the users have the nice Encoder 2 interface rather than something horrible like I’d likely come up with (heh).
Thanks for any insight you may have.
-afh3
By Alex Zambelli on Mar 27, 2009
Hi afh3,
I thought EE2 wrote Smooth Streaming metadata into the client manifest (*.ismc), but I could be wrong. Would you mind sending me an e-mail (my address is listed in the “About” section)? I’ll put you in touch with some folks on the Expression Encoder team and we’ll get this sorted out. Even if it turns out EE2 SP1 doesn’t have the functionality you need, it’ll still be useful to discuss needed features for v3/v4.
Alex
By Tim Weiss on Jun 16, 2009
How would you go about inserting captions into a set of Smooth Streams that have already been created?
By Hrry on Jun 29, 2009
I am trying to do Live Streaming from Encoder 2, using Server 2008, Windows media service 2008, IIS7, My problem is I con’t get the video to show on my webpage, using expression web 2, Can you show me how to setup a simple webpage and connect all thease togather.thank you please answere ASAP.
By Sushil on Dec 30, 2009
Hi Alex,
Can you point me to some resource on how live ad insertion is done or for that matter how to provide text streams dynamically.
The scenario is I have a pre-encoded smooth streaming content and need to append metadata dynamically. e.g. Announcements etc., without having to encode the content again.
Thanks in advance.
- Sushil
By Duncan Smart on Feb 5, 2010
Saw this over at Scott Hanselman’s blog… any idea why HTTP byte range requests weren’t used? e.g:
GET /blah HTTP/1.1
Range: bytes=0-999
…
HTTP/1.0 206 Partial Content
Content-Length: 1000
Content-Range: bytes 0-999/10000
…
This is what the iPhone/iPod Touch does: no special software required on the server – it’s plain old HTTP 1.1.
By Alex Zambelli on Feb 14, 2010
Hi Duncan,
Byte range requests weren’t used because:
1) they’re not universally cacheable
2) at the time of Silverlight 2 (when Smooth Streaming was being developed) not all browser stacks supported HTTP 1.1 yet
By Alex Zambelli on Feb 14, 2010
Hi Sushil,
I recommend you look at http://www.codeplex.com/Wikipage?ProjectName=smf because the Silverlight Media Framework is already enabled for interpreting ad insert metadata.
If you’d like to insert ad markers into pre-encoded content, you could do it by appending ad stream metadata to the client manifest (*.ismc), like this:
The string within the F tags is just a base64 encoded string formatted in a way that makes sense to your client.
By Sushil on Feb 15, 2010
Hi Alex,
Thanks for your answer.
I got the method you were explaining but in my scenario, if I don’t know what announcements/ads to place in advance and also what position (e.g. I need to put markers for user comments in text stream which are stored in a db) what should I do?
I basically need a way in which everytime user starts a video I should be able to create text-stream dynamically. At the same time I dont want to generate the entire ismc file at runtime.
Please help.
By Alex Zambelli on Feb 16, 2010
Are you still talking about on-demand video, or live video too?
If you don’t know what ads/comments you will need to insert ahead of time, then insert markers which correspond to some lookup scheme (e.g. DoubleClick calls).
By Sushil on Feb 19, 2010
Yes I meant about on-demand content.
My only point was, as player is pulling video & audio chunks based on bandwidth, cpu & seek-poistion etc, it could have helped me to serve metadata based on clients’ capabilities. And the way to do it was to generate the text stream dynamically.
Thanks anyways.
By David on Apr 15, 2010
Hi Alex, I’m trying to use gzip compression in IIS smooth streaming. I have configured compression function on IIS 7.0, and add according HTTP response header. But I’m still a little confused that if I need prepare compressed files on IIS7. How to do with manifest file? You know, the client request the manifest file by the URL “http://host/program.ism/manifest”, but actually there is not a file named manifest, but program.ismc instead. So do I need compress the program.ismc as program.ismc.gzip on IIS7? Or IIS7 can do the compressing work on demond?
By Alex Zambelli on Apr 16, 2010
GZIP compression in IIS is dynamic – you don’t need to manually zip/unzip anything. The entire process ought to be completely transparent. Just enable Dynamic Compression in IIS7 (you must remember to install the feature – it’s not a default feature) and then use Fiddler2 or HTTPWatch to monitor your Silverlight player’s HTTP traffic. The headers in the outgoing HTTP requests should contain “Accept-Encoding: gzip, deflate” and the headers in the HTTP server responses should contain “Content-Encoding: gzip”. If you don’t see compression applied to video and audio MP4 chunks, don’t worry, gzip compression on compressed video/audio data isn’t very useful anyway since the content is already highly compressed. The true benefit is in client manifest compression, since uncompressed client manifests can often exceed 500 kbytes for long content.
By Mark on May 6, 2010
Hi Alex,
This is a very usefull documentation that covers the the VOD and partial the LIVE part of the smooth stream
Thank you, this is verry usefull.
The problem for me is that i can’t find documentation for the LIVE part where the the encoder is pushing data to the IIS.
When testing the “IIS Smooth Streaming Player Development Kit – Beta 2″ i see that the pushencoder is sending manifests for each stream. It’s that all? send manifest before streaming the stream with the specification that you posted here? How do i find the place to post the specific stream? is random string? “http://localhost/live/stream.isml/Streams(4996-stream5)”
This is very vague. There is no documentation for that.
It would be wonderfull if you can help me understand how it works.
Kind Regards,
Mark
By gperetz on May 17, 2010
Hi Alex. Maybe you can help me.
I am trying to simulate a live smooth streaming client. I’m having probelms in calculating the time of the fragments that do not appear in the manifest. I was trying to analyze the fragment moof box to get the duration, and add it the fragment time. The problem is it appears to be incorrect (or maybe should be used in a different manner) for the audio stream. The video stream duration brings me to the next fragment just fine. However, with the audio stream it often misses the time (by a very small difference, sometimes just 1 increment) and so the server returns 404. I don’t see that behaviour with the silverlight client – it always gets status 200 from the server, and requests the manifest only at the beginning of the play. More details: I am getting a moof box – indise the traf box I check the tfhd box and trun box. the tfhd box (defaults) has no values in it. the turn box has SampleSize, SampleFlags and SampleDuration present, and shows 6 or 7 samples. What I did was to sum the SampleDuration of the samples, and got the fragment duration. Did I miss anything?
By Alon on Jun 16, 2010
Thanks for this excellent explanation.
Looking at the client manifest file example in the link above I noticed that the video framents and audio fragments are not aligned (different durations).
Does this imply each fragment contains just video or just audio or is that simply due to some “non uniform interleaving” of video and audio data within mdat of each fragment?
Can you clarify?
Thanks,
Alon
By Alex Zambelli on Jun 25, 2010
@Alon: The physical fragment on disk can contain both audio and video (as is the case with content encoded with Expression Encoder for example), but the HTTP client request must be made separately for each so the fragment that gets sent over the wire only contains either audio or video.
The video and audio fragments are allowed to have different durations (especially since audio fragments tend to be smaller than video fragments) because the A/V samples don’t need to be multiplexed before they’re handed off to the decoder.