Compressing Audio Data for Mobile

Most of the principles governing compression of video data for playback on constrained devices like the mobile phone apply also to the audio track. This article focuses on audio data compression. This is not an article about audio level compression, a method for altering the volume levels of audio.

Just like video compression, there are two forms of audio compression, lossless and lossy. In the first case audio is compressed without loss of information. An example is the FLAC (Free Lossless Audio Codec). In the second case information is lost. Examples are Vorbis, MP3 and AAC. We'll discuss codecs later.

Algorithms implementing audio compression use both lossy and lossless techniques. How much compression is applied to the audio portion of a mobile video depends on how it will be used. Today's large hard drives and fast communication systems encourage storage of data in uncompressed format. However the need for compression can be quickly illustrated.

The Bit Rate

When audio is digitized, it is converted into discrete samples. There are two dimensions, one is the number of times per second the audio is sampled (the sample rate) and the second is the resolution of the sample at any given moment (resolution).

The amount of data generated by digitization is called its bit rate in the form of "bits per second" or "kilo bits per second" (Kbps). It is calculated in the following way:

sample rate x resolution x the number of audio channels

A CD quality stereo file that is digitized without compression has a sample rate of 44100 and a resolution of 16 bits. The amount of data that is produced every second the audio is running is as follows:

44100 x 16 x 2 = 1,411,200 bits per second (bps)

or

1.4 megabits per second (Mbps)

Computers store data in 8-bit bytes, so if we can find out what kind of storage space is required for this data rate:

1,411,200 bits / 8 = 176,400 bytes per second

CD-quality sound uncompressed has to be saved to the hard drive at a rate of about 176 kilobytes (kB) for each second of audio. That means a 3 and a half minute song will be stored uncompressed as 3.7 megabytes.

A one-minute viral video 1 megabyte in size cannot have 1 megabyte of audio data embedded in it. Compression must be used to remove data from the audio portion of the video.

The more constrained the environment where the audio is played back, the more data must be removed from the raw audio file. A mobile phone with QCIF (176x144) video resolution will play back video locally at 10 fps (frames per second) at a bit rate of 80 Kbps (kilobytes per second). The bit rate for streaming mobile video is even more constrained. Here is the comparison:

Local Playback: 10 fps, 80 Kbps
Streaming: 7.5 fps, 20 Kbps

Now you know why streaming mobile video looks so bad. Here are some other playback numbers for comparison:

Audio only playback is 32 Kbps, usually described as FM radio quality.
AM quality is 5 Kbps, usually used by voice only Internet broadcasts
CD quality ranges between 64 and 128 Kbps.

Let's look at some ways to reduce the audio data overhead.

Psychoacoustics
Compression algorithms for video are biased towards removing visual frequencies that the eye is not sensitive to. The audio compression algorithms work the same way, preserving those frequencies hearing is most sensitive to and getting rid of frequencies we cannot hear or hear poorly. Although a young person can hear frequencies between 20 Hz and 20 kHz, the ear is most sensitive around 4 kHz in the audio mid range. So when the compression algorithm is forced to throw away a great deal of data, it is likely to choose low and high frequencies.

Another type of sound that we can throw away without too much loss of quality are sounds that are masked by other sounds.

Other Algorithms
In our article How Video Compression Works we describe many of the methods that modern codecs use to remove redundant information from an audio stream. The less random the data, the more redundancy can be removed from the file. Music recorded from the real world that is fast paced and active will not have as much redundancy as computer generated music or simple, quiet and repetitive songs.

Optimizing Audio for Mobile Video Compression

To begin with, you can help video compression do its job by giving it as clean an audio track as possible. This means creating the file as a 16-bit, 44100 file as a minimum, and maintaining that quality right through to compression. It is the same rule that you follow in creating and compressing graphics or still photos.

Think of this as the master. Even though you might be encoding for playback on a mobile device, there is a chance your video may end up on the Internet or television.

Removing defects from the soundtrack is important. For example, remove background noise. If you used a consumer card to record voice, you may have a lot of noise in the recording, some which you may not be able to hear. See our tutorial on removing background noise.

Another defect comes from consumer sound cards recording a constant voltage as part of the waveform, called DC Offset. This distorts the sound, introducing clicks and pops. It can be removed through the DC filtering function in audio editing software. See our tutorial that includes advice on this subject.

Using compressors or limiters to reduce the fluctuations in the amplitude of an audio signal can make it easier to compress. See our article on compressing a voice recording.

You can help the compression software do its job by removing sound and instruments that are either extraneous to your soundtrack or do not contribute significantly to its impact. The more complex the sound mix, the more difficult it is to compress it to the low bit rates without mangling the sound.

For short one-minute viral videos, only using music at the beginning and end of the video, and using voice and sound effects in-between would be an effective strategy if appropriate.

You can also remove information from individual tracks. For example, in the voice track in a multitrack recording cut frequencies below 100 Hz cut and boost frequencies the 1 and 4 kHz range. Most people will not hear the difference, especially when the soundtrack is played back on a mobile device.

Audio Codecs

Audio codecs vary according to the purpose for which they were designed. Some are designed to compress voice without music, others are designed for music. The best that a music-oriented CD-quality algorithm can achieve is to create a file 25% of the original size.

The most common lossy codec is MP3 (MPEG-1 Audio Layer 3), widely used to store audio on portable audio players. It is considered to be a first and second generation MPEG standard. MP3 encodes pulse-code modulation-encoded (PCM) audio data to a smaller size by discarding audio that is difficult for people to hear.


The Ogg Vorbis codec is an open source codec developed when Fraunhofer Gesellschaft decided to start charging licensing fees for MP3. It not supported as well in the market as MP3.

AAC (Advanced Audio Coding), is also known as MPEG-2 Part 7, and in a slightly different form, MPEG-4 Part 3. AAC is regarded as the successor to MP3 because it improves on the compression ratio (1:16 versus 1:10) and has better audio quality at a lower bit rate. Among different algorithms it uses is one called modified discrete cosine transform (MDCT) to more efficiently compress certain types of complex audio waveforms. MPEG-4 AAC codecs offer good audio quality at bit rates down to 24 Kbps.

AAC includes DRM (Digital Rights Management), unlike MP3, and supports multiple profiles for application to different environments, from constrained mobile devices to home theatre systems.

AAC has been widely supported by mobile phone manufacturers, as well as DVB, XM satellite, iTunes, the iPod and PlayStation Portable. It is found in both MPEG-4 and 3GPP / 3GPP2 mobile video formats.

AACPlus (also known as AAC SBR) is an improved version of AAC. It uses an algorithm to improve reproduction of high end frequencies. It is also more efficient at compressing stereo audio.

See these tables for a comparison of audio codecs, including operating system support, technical details and bit rates.

VBR and CBR

Video and audio programs vary widely in their complexity over time. In the case of audio, a passage with lots of instruments and vocals has a lot of detail and will require more compression than a passage where there is a solo instrument playing. Codecs offer two basic choices for encoding audio: CBR and VBR.

CBR stands for "Constant Bit Rate" encoding. It is the best (read safest) choice for streaming audio at any bandwidth, but especially over limited rate channels. It allows the audio stream to take advantage of all the capacity available. However fast or detailed passages will have to be more heavily compressed than quiet or slow passages. This results in an inconsistent level of quality throughout the soundtrack, with fast or complex passages showing obvious compression artifacts.

VBR stands for "Variable Bit Rate" encoding. As the name implies, the encoder varies the amount of data output per unit of time. This allows the encoder to assign a higher bit rate to complex passages, and a lower bit rate to simple sections. A larger file size is produced, about 5% for audio files. But the overall quality of the encoded file, or its average bit rate, is higher.

CBR is largely used for low capacity streaming applications, whereas VBR is used for high capacity (fast DSL) applications or where the audio is downloaded and then played.

One-Pass and Two-Pass Encoding (Multipass)

In one-pass encoding, the encoder examines sections of audio and performs the necessary compression to reach the targeted bit rate. In two-pass encoding, the encoder runs through the entire video gathering information about it, before encoding it during a second pass. This produces an audio file with a better average bit rate, and therefore superior quality, although VBR encoded files benefit more from two-pass encoding than CBR. CBR encoding not have any flexibility in the bit rate for each frame, whereas VBR does. The drawback is that VBR can take up to twice as long to encode an audio stream.

A complete list of video and audio codecs can be found at the Wikipedia: http://en.wikipedia.org/wiki/List_of_codecs






Very informative article! I spent a lot of time trying to decide what format to use on my wifes smartphone and my pocket pc. We are currently using Ogg for a variety of reasons but as flash media gets larger I can see the day when Flac will be our mobile format of choice.