How Video Compression Works

Compressed video may be compared to frozen concentrated orange juice. By removing water from freshly squeezed orange juice at the packing house, canned frozen juice can travel great distances at the fraction of the weight and volume of the original liquid. When the consumer mixes the water back into the concentrated juice, they are returning the juice to its original weight and volume. Compression removes the bulk of data from a video file then returns that bulk to the file when it is decompressed. Of course, the decompressed video is not exactly like the original file, although good compression will minimize the apparent difference. You prefer freshly squeezed orange juice? Tough.

The need to reduce the bulk of uncompressed video streams is obvious when you calculate its data rate, the rate at which its bits and bytes stream over a communication channel, such as a video file moved from storage to the screen. Let's take the example of digital video created for the screen resolution of common garden variety video cell phones: QCIF (176x144 pixels).

Pixels: 176x144 = 25,344

Frame Rate: 30 frames/sec

Color Depth: 20 bits/pixel

Framing Overhead: 1.3

Calculation: 25,344 x 30 x 20 x 1.3 = 20 Mb/sec

The cell phone would have to move 20 megabytes of data from storage to the screen every second that the video is running. This is way beyond the data handling capacity of cell phones. In fact, it is generally accepted that "high quality" video on mobile devices means QVGA resolution:

- a video bit rate of 384 kbs,

- 15 fps (frames per second),

- a frame size of 320 by 240 pixels,

- a 24-kHz sampling rate,

- and stereo sound with an audio bit rate of 128 kbs.

On cell phones the video rate has to drop to 64 kbs, the screen size has to be reduced to 176x144 pixels and the audio rate to 64 kbs for video to be shown with acceptable quality.

The other factor in video playback is the computing muscle of the mobile device. Low end mobile phones may not have enough processing power to achieve higher video frame rates. At the lower speeds (15-20 kbs) of 2nd generation networks, a frame rate of 5 or 6 fps should be targeted. Even at higher bit rates (50 kbs) most users will be able only to play back at 7 or 8 fps.

But the situation is changing rapidly. The first generation analog cellular networks had maximum bit rates (the rate at which computer bits flow over a communications channel) of less than 10 kbs, under 5 kbs under normal network loads. Although the latest generation (3G networks) are theoretically capable of data rates of 384 kbs, a number of factors combine to reduce the effective rate, including the speed you are traveling, interference, network load and the strength of the signal. The average user sharing a cell tower with others should expect about 100 kbs. This makes 3G barely sufficient for streaming video. Again, the cautionary note is that technology improvements will constantly edge these figures up.



A glance at the graph above shows why video over cellular networks has become possible. Compression algorithms are crunching videos down to smaller and smaller bit rates while networks are moving up in speed. The sweet spot is an MPEG-4 video (.mp4 or .3gp) playing over a 3G network.

Video compression technology has made it possible to reduce the video data rate on standard television to 4 Mbps and HD (High Definition) to 16-20 Mbps. Using MPEG4 compression technologies, these data rates can be halved to 2 Mbps and 8-10 Mbps respectively. (See our article on MPEG4 compression.) When television is delivered over cable or DSL, more bandwidth is required than the minimum for the video stream. That's because the DSL and cable bandwidth is simultaneously shared with VOIP, data and other services. If there are two people sharing the same Internet connection watching two different video streams, than each video stream will require its own share of the bandwidth, doubling the video stream data rate required.

Generally speaking, the pocket video producer must be as savvy about compressing video for distribution as the web video producer. However, the need for compression is much greater in the mobile space when you consider the common denominator is not a 3G user, but rather a user who is the equivalent of the dial-up Internet user.

So in the rest of this article, we'll help you understand video compression better so that you can create smarter digital videos that compress to the smallest size and bit rate as possible.

The Simplest Form of Compression
The simplest form of compression is to eliminate every second frame of a video and duplicate the preceding frame. This cuts the number of frames in the video in half, then pads out the video again to the original number of frames to preserve timing.

In this original video, the video is running in a cycle of 15 frames:


15 unique frames

Here is the same animation, with every second frame deleted and the previous frame duplicated:


Frame Doubled

Notice how smooth the first animation is compared to the second animation. If you look carefully you will see the cycle is the same, but the second horse seems to going slower. However, it is the file size that is the biggest difference. The first image is 72 kbs, the second 35 kbs. Removing every second frame and frame doubling has reduced the file size by half.

What if we had simply removed every second frame and not doubled the previous frame?


8 frames

The horse appears to running faster because it is going through the run cycle at half the rate of the original animation. And the file size is the same as the frame-doubled (second) example.

All these examples use gif animation which uses a simple form of compression.

Color Compression
Another form of compression is color compression.



256 Color PaletteThe gif format used by the horse animations in the previous section uses a kind of color compression. Each pixel in a gif image can be represented by from 2 to 256 different possible colors. A black and white gif image, for example, only needs two computer bits to store the color of the pixel, black and white. Because each pixel only requires two bits to store the value of the color, the image file size is small compared to 24 bit full color images.

The running horse animation only uses 256 colors, mostly shades of green and grey. An image with 256 colors uses 8 computer bits per pixel. Adobe calls this format "indexed color" because the color values are stored in a "look-up table" or palette with the file.

The image on the left shows the palette for the horse animation in the previous section as represented on screen by Photoshop.

Now let's see what happens when we color reduce the image from 256 colors to 16 colors. (Sixteen colors can be represented by 4 bits or 2 x 2 x 2 x 2 bits.)

Here is what the new image looks like:


16 color palette

Compare that to the original:

256 Colors

Except for a slight darkening of the 16 color (4 bit) version, the images are almost identical. However the 16 color image is 1.88 kb in size, whereas the 256 color image is 5.18 kb in size, almost three times smaller. That's because only 4 bits of color information are stored for each pixel of the color reduced image, versus the 8 bits for the 256 color image.

What is noticeable about the color reduction, is that our eyes can't see a big difference between the two images. So good color compression depends on taking advantage of this weakness in human perception.

The gif image format is used for still computer graphics. During compression, video uses a different color space than computer graphics, but the basic principle is the same. The eye is very sensitive to changes in luminance (brightness) in an image, but not to chroma (color minus the luminance), so the focus in video color compression is on color not luminance.

Normal TV assigns 8 bits or 256 possible colors to each of the colors displayed on the screen: red, green and blue (RGB). So the total number of bits per pixel is 24 (3 x 8). At cell phone resolution (176x144 pixels), that's 608,256 bits of information, two thirds of a megabyte for each frame of video, or over 18 megabytes per second at 30 fps. By reducing the color depth to 16 bits per pixel, and reducing the frame rate to 15 fps, you are able to reduce the uncompressed video form 20 mb per second to 8.6 Mbs.

But we are still a long way from 100 kbs. For perspective, the data rate of full-motion video at 640x480 pixel resolution, 24 bits per pixel, is 221 Mb/sec. At 100 kbs, it would take over a day and half to transmit one minute of uncompressed standard television. We have a long way to compress full screen video into a transmission speed for 3G systems.

Using Math to Encode Images
The previous simple forms of compression demonstrate ways that redundant information can be removed from still and moving images, like water from orange juice. But halving the frame rate and reducing the amount of bits assigned to each pixel for color definition is simply not enough.

The next compression technique involves using a mathematical formula to represent redundant information in an image. Instead of storing and transmitting the redundant information, you store and transmit the mathematical formula that describes that redundant information in a more compact form. An analog is a recipe. A recipe for a cake stores the instructions for creating the cake. Instead of storing or shipping a cake, you store or ship the recipe.

In its simplest form, it's called RLE or Run Length Encoding. Let's take the example of this block of pixels, which may be an edge in the running horse image:


Sampled Edge
If the image is being read from a file into a display, pixel by pixel, line by line starting from the top left corner, it might read like this:

green, green, green, green, green, green, green, white, white, white, white, white

That is a lot of redundant information. The pixel colors can be encoded in a much more abbreviated form. This would be the recipe for storing the color information:

(7 x green) + (5 x white)

The recipe for recreating the image is much smaller than the uncompressed version.

RLE encoding works well for computer graphics where there are large areas of identical color values, like our running horse animation. But a video of a real horse with its natural coloration would not compress well using run length encoding. So mathematicians have developed very complex solutions for handling images with continuous tone areas of color.

DCTs
The idea behind Discrete Cosine Transformations (DCTs) actually is an old one, going back several centuries. Joseph Fourier developed a theory that any series of numbers can be produced by a sufficiently complex equation. So this is basically the "recipe" idea again. Store the recipe, not the cake.

The equation uses the intensity values of an image. It does this by dividing the image into 16x16 (or 8 x 8) pixel sub-images and transforming the pixel values into intensity values. In the case of a continuous tone image like a picture of a horse, there is a very gradual change in intensity over an area of 16x16 pixels, usually with no sudden spikes of intensity. So a relatively compact mathematical description for a gradual change in frequency can be created and stored.

Images or video frames that are divisible by 16 compress more efficiently because of this method of dividing up the image into frequency domains.

Obviously this approach does not work so well with images that have sharp edges (with corresponding sharp changes in pixel intensities) like text, blades of grass or images with contrasty lighting. When DCT compression has difficulty in dealing with edges, you see ringing and blocking, where artifacts (erroneous pixels) appear around edges or the edges of the transformed block appear.

The reason why this approach works so well is that the eye is not sensitive to small changes in high frequencies and most images have little amplitude in the higher frequencies. So we can throw away detail (bits) in high frequencies without the eye noticing. The amount of bits assigned to a pixel can be reduced from 8 to 2. This is the basis for JPEG compression.

Temporal Compression
So far the types of compression we have presented have applied to individual frames. Video is a a series of images presented to the eye over time. The fact is, one frame is a lot like the next.


Frame One

Fame Two

Besides the white background and grey ground being common to both frames, most of the rider's and horse's body are common to both images. So we only need to encode the difference between the first and second frame.

However even a small camera jiggle can throw this form of compression off, so techniques for motion estimation and compensation were developed. These techniques try to determine if part of the previous frame has moved to a new position in the present frame.

The first frame in a temporal compression sequence is called the keyframe, or the I-frame in MPEG. Keyframes are usually automatically inserted at the point in a video sequence where the difference between one frame and the next is small enough to warrant a new keyframe. Some compression programs allow you to designate where keyframes should occur. A good candidate is at a video cut, since an entirely new frame with no similarities to the previous frame will occur at that point. Editing formats like DV make every frame a keyframe so that you don't have to wait for the editing program to find a keyframe when it is jumping forward or backward in time. If you jumped to a frame that only held the differences from the previous frame, you would have to work back to a keyframe and then work forward again, picking up the differences that accumulate at the editing insertion point.

The frames that occur between keyframes are called delta frames or P-frames in MPEG terms. If a delta frame is dropped during playback, playback has to stop until the next keyframe is reached.

There is another type of intermediate frame called a bi-directional frame, or B-frame in MPEG terms. This frame is based on the previous and subsequent frame. They compress better than delta frames, and have both advantages and disadvantages. The biggest advantage is that they can be dropped during playback without stopping playback. The biggest disadvantage is that they cannot have a subsequent delta frame based on it, so they are little value in motion prediction and compensation.

Conclusion
We have provided an overview of different compression techniques, from simple frame doubling, to complex temporal compression using motion prediction and compensation. Contemporary codecs (compression / decompression algorithms) employ all these techniques, achieving very impressive results. It is possible to compress a cell phone video (176x144) running at 15 fps to 100 kbs. However your success in achieving the rate depends heavily on the content of the video. A handheld panning shot of a horse at a race track would be hard to compress to this bit rate. The same horse shot on a tripod in a field grazing would compress a lot better. And if the background is out of focus, it would compress even better.

Knowing how compression works is critical to the success of the pocket video producer.