Phoneme libraryThe construction of the phoneme library follows the ideas described in [Cam85]. Utterances containing the diphones of interest were recorded, and the segments containing the diphones were selected.
Since Portuguese is essentially a syllabic language, the majority of the diphones are of the form <consonant> <vowel>. One word containing the diphone was selected, according to the following criteria, adopted to minimize coarticulation effects:
- the utterance must be an existing word in the language;
- the diphone must be the stressed syllable of the word;
- the diphone must not be the first or last syllable of the word;
- the consonant must be preceded by a vowel;
- the syllable following the phoneme should start with a plosive, if at all possible.
These criteria allows both an easy isolation of the diphone and no interference of coarticulation effects.
It was found that is very important to use real words, since isolated syllables are uttered in a quite different and unnatural way. With meaningless word the situation is a little better, but they should be avoided by the same reason.
Stressed syllables were selected because they are usually uttered with far more clarity then the other syllables in the word. We do not know of formal studies about this fact, but it seems to be fairly general, and not a specific property of the spoken Portuguese language.
The chosen word for recording each diphone was also required to have a plosive sound after the selected diphone to simplify the identification delimitation of its last frame.
To improve the characterization of individual consonants, it was further required that the consonant was bounded by vowels. Vocalic sound are easily recognized in the spectrogram, defining clearly and unambiguously the limits of the consonant.
Words containing the semivowels (/l/ and /r/ in Portuguese) were selected following essentially the same criteria, and included in the library.
The steps required for each phoneme/diphone were
- recording of the word;
- analysis by the standard CELP algorithm;
- isolation of the phoneme/diphone by visual inspection of a spectrogram generated from the LSP parameters;
- the duration was standardized to 195 ms (26 7.5 ms frames);
- synthesis of the selected phoneme/diphone, for checking purposes;
- inclusion in the library.
A further step is required after the library is assembled. Since the samples were taken from different contexts, they might have quite different intensities, and two normalizations are required. The first is intra-vowel intensity normalization.
At least in Portuguese, the intensity of a diphone depends essentially on the vowel intensity. So. diphones of the form <conso-nantxvowel> should have approximately the same intensity for the same vowel. All the diphones containing the same vowel were normalized for a the same intensity, by multiplying the stochastic gain by a suitable value. This value can be, for instance, the relation between the average original intensity and the desired normalized value.
It is further required to normalize inter-vowel intensities. That normalization is essentially subjective. For each two vowels, diphones with the same consonant are chosen; one is maintained constant, and the intensity of the second is varied until both are perceived with the same subjective intensity. All the diphones containing the second vowel have their intensity multiplied by that ratio. The procedure is repeated for all the vowels present in the library.
Interpolation techniquesOnce the diphone library is available, the next problem is the composition of a given sequence of phonemes. The following techniques are heuristics that offer good results for the Portuguese language, and can be considered a starting point for developing a similar system for other languages.
In this section the synthesis will be considered left to right; i. e. always between an already processed sequence and an arriving diphone.
The Linear Prediction model generates voice by the excitation of a filter by an appropriate signal. Accordingly, the following discussion will show the interaction, in the phoneme boundaries, of both the parameters that represent the filter (LSP parameters) and the excitation parameters (CELP parameters).
There are three possibilities, depending on the nature of the incoming phoneme/diphone:
- the diphone starts with a plosive consonant;
- the diphone starts with another kind of consonant;
- the phoneme is a vowel.
The simplest case is a diphone starting with a plosive. Since there is an interval of silence separating the diphone from the preceding sound, there is no significant coarticulation effects, and no interpolation is necessary. The silence period preceding the start of the plosive sound is fixed in 30 ms. Although this interval varies in real speech, the adoption of this value allows avoiding coarticulation effects.
Initially, a silence interval was introduced. Although the speech was of good quality, the subjective feeling was strange for some listeners. To improve the synthesis quality, another strategy was tested, and removed that unpleasantness. The LSP coefficients were interpolated between the last interval of the previous diphone and the diphone being introduced. The pitch of the last interval of the previous diphone was maintained, and the intensity strongly reduced, by using an adaptive gain well below 1.
Encounters between a diphone starting by a non-plosive consonant are subject to strong coarticulation effects. The coarticulation according to [Sha87] tends to be a phenomenon that precedes the second diphone, as if the vocal tract was preparing for the new geometry required for the incoming phoneme, using the last and less important part of the preceding diphone.
To simulate this effect, the LSP parameters were interpolated for a variable interval involving mainly the last part of the previous diphone. The best results, based on subjective sound quality, were obtained by using a 60 ms interval, starting 45 ms before the end of the previous phoneme. The LSP parameters were linearly interpolated; the CELP parameters didn't require any processing.
The solution was simpler than expected.
Processing of vocalic encounters is critical in Portuguese. Although the crudest concatenation generates understandable utterances, even small deviations from the natural sounds are easily identifiable and quite annoying.
Sharp transitions are rare; usually, diphthongs are generated. As diphthongs correspond to slow transitions of the vocalic tract, the resonance aspect can be easily simulated by interpolation of the LSP parameters. It was found experimentally that a duration around 160ms, symmetrically distributed in the phonemes offered the best results for vowel-vowel encounters, and 80 ms for encounters involving semi-vowels.
Processing of the CELP parameters is somewhat more complex. The profile of the CELP parameters of a diphthong has the same general shape of the parameters for a single vowel: the adaptive index tend to remain constant, the adaptive gain tends to 1, and the stochastic gain decreases. This parameter profile must be generated over the total length of the synthesized diphthong, starting at 1.0 and going to 0.75. Figure 1 is a plot of the adaptative gain for a natural utterance containing three diphtongs, showing the regularity of the patera.
This is not enough, however. The subjective intensity of each vowel is different, and the direct application of the above rule must be changed to take in account that difference. The value of the adaptive gain must be changed to the inverse ratio of subjective intensity of each vowel.
ConclusionsCELP coding techniques offer a strong basis for speech synthesizer systems. The quality of synthetical voice is far superior to other techniques using LPA. Algorithms are relatively simple and straightforward.
CELP is only one of a family of analysis-by-synthesis systems; since they differ mainly in coding details, it is expected that any of them will be a suitable basis for developing voice synthesizers.
The experiments were conducted for the Portuguese language, that has very few consonant encounters to be synhtesised. The results suggest that the technique can be easily extended for other lan¬guages.
References- Cam85 Campos, Geraldo L. Speech synthesis for unrestricted vocabulary: the phoneme-diphone approach. Proceed¬ings of the Speech Tech' 85, New York, New York, 307-309, 1985.
- Cam89 Campbell, L.Welch, V. and Tremain, T. "An Expand¬able Error-Protected 4800 bps CELP Coder", Proc. of ICASSP, 735-738, 1989
- Cam95 Campos, Geraldo L. and Chbane, Dimas T. "A Finate Automaton Translator of Text to Phonemes for the Portuguese Language". Annals of the XIHth Interna¬tional Congress of Phonetic Sciences, 28.2-8, 1995.
- Cox89 Cox, R., Kleinj, W. and Kroon, P. "Robust CELP Coders for Noisy Background and Noisy Channels", Proc. of ICASSP, 739-742, 1989.
- Det92 Details to Assist in Implementation of Federal Standard 1016 CELP. Office of the Manager, National Commu¬nications System, Jan. 1992.
- Fed91 Federal Standard 1016 - Telecommunications: Analog to Digital Conversion of Radio Voice by 4800 bit/sec¬ond Code Excited Linear Prediction (CELP), National Communications System, Office of Technology & Standards, Washington, DC 20305-2010, Feb. 1991.
- Lin86 Lin, D. "New Approaches to Stochastic Coding of Speech Sources av Very Low Bit Rates", Signal Proc¬essing HI: Theories and Applications, Proc. of EUSIPCO-86, 445-448, 1986.
- Sha87 O'Shaughnessy, D. "Speech communication: human and machine", Addison-Wesley, New York, 1987.