Digital is becoming the dominant technology in today's audio applications. With satellite radio, VoIP, cellular phones, and MP3 players, digital is replacing analog techniques in broadcast entertainment, telephony, and even personal audio recording. Hidden beneath all these applications, a digital signal processor (DSP) is working overtime to translate the complex sounds we hear into a minimal bit stream. The first step in applying digital audio signal processing is to look at the coding processes in use.
Data bandwidth is one of the first places developers look when considering audio codecs. Human hearing spans a frequency range of approximately 5 Hz to 20 kHz, with a useful signal range of 120 dB from inaudible to painful (see Table 1) . Taking a brute-force digitization approach, this gives a Nyquist sampling rate of at least 40 kHz, and you need two audio channels for stereo. Compact disc (CD) audio, for instance, uses a 44.1 kHz sample rate with a 16-bit sample (96 dB range) for an audio data rate of 1.4 Mbits/sec to achieve stereo. To the raw data rate systems you must often add further overhead for encapsulation, channel coding, and error-correction coding where needed. The fully-encoded digital data on CDs produces a bit stream at 4.3 Mbits/sec.
|10-5||110||Jack hammers, rock concerts|
|10-7||90||OSHA limit for industrial noise|
|10-12||40||Weakest audible at 100 Hz|
|10-14||20||Weakest audible at 10 kHz|
|10-16||0||Weakest audible at 3 kHz|
Table 1: The range of human hearing
In many applications, these data rates are much too high and some form of compression is required. Compression comes in two forms: lossless and lossy. Lossless compression exploits the statistical redundancy that occurs in many signals. For example, the coding of English-language text requires at least 5 bits per symbol to account for 26 letters. The use of those letters, however, is not uniform. The letter “E” for instance, comprises nearly 13% of a long text message, whereas the letter “Z” is less than 1% of the text. By using a variable-length code that assign fewer bits to “E” and more bits to “Z”, the average number of bits per symbol can be reduced without losing any information.
Unfortunately, lossless compression is not enough to reduce audio data streams to manageable bit rates. There is not enough of a pattern to achieve much more than a factor of two in compression. As a result, audio codecs must turn to lossy compression techniques.
Lossy compression requires that information be selectively eliminated from the signal. A simple example for audio is the use of silence detection. When there is a pause in the audio, you can save bits by replacing the samples of silence with a code that indicates the duration of the silence, then reinserting the missing samples during decoding.
The trouble with lossy compression is that it affects perceived sound quality. In the case of silence detection, for instance, we might notice the lack of background white noise during the reconstructed “silent” periods that we otherwise hear during the rest of the audio signal. Depending on the application, this can be disconcerting.
A wide variety of digital audio coding standards have arisen over the last decade because of the range of tradeoffs that developers must consider. Each application has its own optimal combination of data bandwidth, perceived sound quality, processing delay, and other design considerations, and each has evolved its own audio compression standard. Among these varying standards, however, there is some commonality to the audio signal processing approaches being used. They can be grouped in three major forms of audio coder/decoder (codec) approaches: direct audio coding, perceptual audio coding, and synthesis coding.
In direct audio coding, the main tradeoff is between bit rate and audio bandwidth. By first filtering the sound before encoding, you can reduce the Nyquist rate needed for proper digitization. You can also limit the dynamic range of the signal. Both limits are standard practice in telephony communications. Human speech puts most of its sound energy in the 200 Hz to 3.2 kHz frequency band. Further, even yelling has a volume level of only about 70 dB above inaudibility, so 12 bits is sufficient to capture conversations. By filtering the signal and sampling at 8 kHz, telephony systems thus reduce their bit rate to 96 kbits/sec. The resulting sound quality is adequate for a listener to recognize the person speaking by the sound of their voice and to catch the nuances of voice inflection, making it an acceptable level of sound quality for conversations. It is considered low quality for audio such as music, however. For instance, AM radio transmissions have a similar audio bandwidth.
For music and other wideband audio, however, direct coding must take another approach. One of the most common techniques is adaptive differential pulse code modulation (ADPCM), shown in Figure 1 . This technique uses an adaptive predictor of the incoming signal, then encodes the difference between the signal and the prediction. If the predictor works well, the differences are small and small values require fewer bits to encode.
Of course, it is possible for the sound to jump from silence to maximum, and using fewer bits will result in truncation of these large signals. The adaptive nature of the predictor responds to large input deviations by altering the step size in the quantization so that there is a better mapping between the signal and the bit resolution. This approach will result in a variable slew rate on the output and some overshoot when signals change rapidly, such as when high frequencies are present. It is computationally easy, however, and achieves compression of about 4:1. ADPCM is used in many audio applications, such as some types of .wav files and many international telephony standards.
Perceptual audio coding seeks higher compression levels than direct methods by utilizing knowledge of human hearing. The goal is to eliminate audio information in a way that the listener perceives little or no difference between the decoded audio signal and the original. One of the simplest, and most common, approaches is called companding, and it reflects the ear's response to changes in audio volume.
Human hearing is logarithmic in its response to audio power, with perception running roughly e1/3 , that is, a 10x increase in acoustic power is perceived as a doubling in loudness. We can only detect changes of about 1 dB in power, so the 120 dB audio signal strength range from inaudibility to discomfort represents only 120 noticeable steps in volume. Companding exploits this perceptual characteristic by making the sample size, or quantization step, vary according to the signal level. At low signal levels, samples are close together. Large signals have larger steps between sample levels, resulting in fewer steps needed to cover the signal range. Typical telephony companding reduces the 12-bits needed to capture speech to 8-bits for transmission and reconstruction, lowering data rates to 64 kbits/sec.
There are two companding schemes in common use. In North America, the µ-Law compander is common in telephony; Europe uses the A-Law compander. Both are similar in their calculation.
The µ-Law is defined as:
where µ = 255 and 0 < x < 1
The A-Law compander uses
for 0 < x < 1/A
for 1/A < x < 1, with A = 87.6
Both formulae map a 12-bit input signal to an 8-bit output signal in a non-linear fashion that drops the information in loud audio signals that the ear cannot hear.
This form of coding reflects only the ear's amplitude response. Human hearing is also logarithmic in its frequency response, gaining as much information in the octave from 50 Hz to 100 Hz as in the band from 10 kHz to 20 kHz. Further, amplitude and frequency interact in human hearing, resulting in the masking of soft sounds by louder ones at nearby frequencies. Sophisticated perceptual coding utilizes all these attributes to reduce data rates even further and forms the basis of such audio codecs as the MPEG-2 advanced audio coder (AAC) and Dolby AC3.
The general structure of a perceptual audio codec is shown in Figure 2 . The incoming signal passes through a filter bank so that the energy in each frequency band can be encoded separately. Comparing the energy levels in each band to the perceptual model dictates the quantization step size needed for each band. When strong signals in one band will mask signals in adjacent bands, for instance, the quantizer does not even sample the masked signals. After quantization, lossless coding reduces the bit stream further, preparing it for channel coding and formatting. This kind of perceptual coding can significantly reduce bit rates, as shown in Table 2 . The 1.4 Mbit/sec rate of CD stereo, for instance, reduces to 128 kbits/sec with no noticeable change in sound quality.
|Bit Rate||PAC Audio Quality|
|128 kbits/sec||CD Quality|
|96 kbits/sec||CD-Like Quality
Audio BW 17Hz to 18 kHz
Dynamic range 80 dB
|64 kbits/sec||Near-CD Quality
Audio BW 13Hz to 15 kHz
Dynamic range 70 dB
|48 kbits/sec||FM Radio Quality|
|24 kbits/sec||Audio BW 6Hz to 8 kHz|
Table 2: Sound quality for perceptual audio coding
While perceptual coding goes a long way to reducing data rates for music and other wideband audio, it still falls short of the compression levels needed for wireless telephony applications. In these applications, where data bandwidth is critical, the audio codec does not send sound samples in any form. Instead, the codec uses a model of human speech generation, known as linear predictive coding (LPC) as the basis of compression.
Over short intervals, about 2 to 40 milliseconds, human speech can be modeled using three parameters, as shown in Figure 2 . The first parameter is a choice of sound source, either random noise or a pulse train. The noise source corresponds to the frictive sound of a consonant, such as “s”, “f”, and “v”. The pulse source corresponds to voiced sounds such as vowels. Because voiced sounds may vary in pitch, the model includes a frequency for the pulse source as the second parameter. The third parameter is a set of coefficients for a recursive linear filter that models the acoustic response of the mouth, throat, and nasal passages during that moment of speech.
This model can be used to synthesize speech, although it sounds mechanical and not human. It is also the core of code-excited linear predictive coding (CELP) as a speech audio codec. In the CELP algorithm, the codec stores a “codebook” of LPC models and compares the sound generated by the model to a segment of incoming sound, adjusting the comparison to ignore differences that the ear cannot perceive. It chooses the best match among the models, then transmits the codebook index and an error term to the receiver. Periodically, the codec will alter its codebook, requiring the occasional transmission of filter coefficients, as well.
The voice quality that results from CELP coding is somewhat less than that of other compression methods, but it still offers adequate voice fidelity for normal conversations because the error term corrects much of the mechanical sound in the LPC model. The advantage of CELP, however, outweighs the slight loss of sound quality. CELP codecs can compress voice signals to as little as 4.8 kbits/sec.
CELP targets speech coding, not wideband audio, but a similar approach is being proposed for wideband compression in MPEG-4 structured audio. The structured audio approach transmits model parameters as in CELP, but unlike CELP the synthesis model is not pre-determined. Instead, structured audio codecs transmit a description of the model along with the parameters, allowing them to accommodate a wide range of sound sources.
But structured audio, CELP, and the other forms of audio compression are only the foundation for audio signal processing. There are many system considerations that also affect the choice of coding scheme, like the delay that the codec signal processing injects into the flow of information. In telephone conversations, for example, delays as short as a half second between speaking and being heard can completely disrupt conversation. Other system considerations include codec behavior in the presence of audio signal noise and corruption of the digital bit stream due to noise. Knowing the basic approaches, however, gives designers a starting point.