Fundamentals of Perceptual Modeling

> Assessing Quality A few years ago, due to the lack of international standards for measures of the perceived audio quality, the only widely accepted assessment procedure for audio and speech codecs were subjective listening tests.

> ITU-T P.800 Historically related to the assessment of telephone connections, useful methods for assessing the listening quality of telephone band systems were first standardized within the ITU-T. Recommendation P.800 defines for instance the absolute category rating test method (ACR) which has been used for the assessment of speech codecs since 1993. Within the ACR test method, the ITU five grade impairment scale is applied (see Table 1 below). In the telecommunication environment, testing is done without a comparison to an undistorted reference. This copes with a typical situation of a phone call, where the listener has no access to a comparison with a reference, e.g. the original voice of the other party. However, it should be noted that the listening test according P.800 could be regarded as a comparison between a test signal and a reference "in the mind" of the listener. The reason for this is that the listener is very familiar with the natural sound of a human voice.

For comparison reasons, and in order to be able to merge the results of different individuals, it is necessary to adjust the listeners' opinions to an absolute scale. For this purpose, predefined examples with well defined noise insertions of fixed modulated noise reference units (MNRU, [ITUT810]) are presented at the beginning of a test. Each sample represents an example distortion corresponding to the ITU-T version of the five grade impairment scale.
> Opinion Scale for Speech Quality Tests
Impairment Grade
Excellent 5
Good 4
Fair 3
Poor 2


Table 1: The ITU-T five-grade impairment scale

Based on these test conditions a population of typically 20 to 50 test subjects will be presented with an identical series of speech fragments. Every test subject will be asked to score each sample by applying the impairment scale. After statistical processing of the individual results, a Mean Opinion Score (MOS) can be calculated. With thorough setups, such test results can be reproduced quite well, even at different locations. Further tests defined by P.800 are the 'Comparison Category Rating' (CCR) and 'Degradation Category Rating' (DCR) procedures. It goes without saying that the effort needed in terms of subjects and time is tremendous. It is clear that such test methods can not be applied within a practical or field environment in the daily life.

> ITU-R BS.1116 The ITU has also recommended a test procedure to assess wide band audio codecs on the basis of subjective tests. Subjective assessments of low bit rate audio codecs in the past always targeted at almost transparent quality. For this reason, the test method focuses on the comparison of the coded/decoded signal to the unprocessed original reference. The relevant recommendation is known as BS.1116, titled "Methods for the Subjective Assessment of small Impairments in Audio Systems including Multichannel Sound Systems" [ITUR1116] which was issued by the ITU-R in 1994 and was updated in 1997.

The test method, which is recommended by BS.1116, is referred to as "double-blind triple-stimulus with hidden reference". It is extremely sensitive and allows for the accurate detection of small impairments. The grading scale used should be treated as continuous with "anchors" derived from the ITU-R five-grade impairment scale according to ITU-R BS.562 [ITUR562]. It is depicted in Table 2.

Impairment Grade
Imperceptible 5.0 0.0
Perceptible, but not annoying 4.0 -1.0
Slightly annoying 3.0 -2.0
Annoying 2.0 -3.0
Very annoying


Table 2: The ITU-R five-grade impairment scale

The analysis of the results from a subjective listening test is generally based on the Subjective Difference Grade (SDG) and is defined as:

Provided that the listener correctly assigns the hidden reference signal, the SDG values will range from 0 to –4, where 0 corresponds to an imperceptible impairment and –4 to an impairment judged as very annoying. The assignment of the SDG scale is shown in the last column in Table 2 above.

In contrast to the listening test according to ITU-T P.800, an explicit comparison between the test signal and a reference signal is needed in the case of BS.1116, since the listener never knows how the original (music) signal would sound like. This method was applied in a variety of international verification tests in the past. However, keep in mind that because of the scope of ITU-R it can be applied to small impairments only, which means a practical limitation to almost "transparent" studio quality. Another issue which has been discussed among experts, is the recommendation to use the scale at a resolution of one decimal place, resulting in 41 (!) discrete steps. There are indications that for some subjects this is too great a choice, and furthermore the meaning of the impairment anchors is interpreted differently [SPOR96].
> ITU-R BS.1534 "MUSHRA" Because of the restrictions to small impairments, other methods are needed for the quality assessment of very low bit rates (i.e. of large impairments). The methods according to ITU-T P.800 were adopted for some assessments to overcome the problem of a "gap" for a useful recommendation on testing significantly impaired wide band audio signals. While in principle they seem to be better suited for impaired music signals when compared to the BS.1116 method, it’s exploitation for very low bit rate audio coding applications remains questionable, as there are no clearly defined example distortions in such a case. The scale was derived from telephone speech quality, and is not well defined when translated to music coding. The achieved results may therefore significantly depend on the subjective interpretation of the impairment levels. Consequently, an advanced listening test procedure has been advised by an EBU expert group, known as "MUSHRA". MUSHRA stands for "Multiple Stimulus With Hidden Reference Anchors". The new method targets testing significantly impaired audio signals, such as those derived at very low bit rates. MUSHRA was adopted as an international recommendation by the ITU-R working party 6Q in 2001. Although first experience has been gained, also MUSHRA exhibits some major drawbacks which it make it very difficulty if not impossible to derive an 'absolut' quality metrics, which one would expect from a measurement. Still, perceptual measurements cope well with MUSHRA results, if properly correlated.

> Continue continue [Advanced Measurements employing 'Perceptual Modeling']