Skip to content Skip to sidebar Skip to footer

Is Continuous Speech a Series of Interconnected Phonemes

Continuous Speech

Speech-Based Interfaces

Sadaoki Furui , in Text Entry Systems, 2007

Continuous Word Scoring

In continuous speech recognition, the output words are generally not time synchronous with the input utterance. Therefore, the output stream has to be aligned with the reference transcriptions. This means that classifications such as substitutions, deletions, correct words, and insertions can no longer be identified with complete certainty. The actual measurement of the quantities through alignment is difficult. The alignment process uses a dynamic programming algorithm to minimize the misalignment of two strings of words (symbols), the reference sentence and the recognized sentence. The alignment depends on the relative weights of the contributions of the three types of the errors: substitutions, insertions, and deletions. Hunt (1990)

discussed the theory of word–symbol alignment and analyzed several experiments on alignment. Usually, the three types of errors have equal weights. Depending on the application, one can assign different weights to the various kinds of errors.

Thus, the total number of errors is the summation of three types of errors,

where N S, N I, and N D are the numbers of substitutions, insertions, and deletions, respectively. The error rate is therefore

(8.13) E = N E N = N S + N i + N D N .

Note that this error measure can become larger than 1 in cases of extremely bad recognition. Often, one defines the accuracy of a system as

(8.14) A = 1 E = N N S N I N D N .

Note that the accuracy is not just the fraction C of words correctly recognized, because the latter does not include insertions.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780123735911500085

Speech Recognition

John-Paul Hosom , in Encyclopedia of Information Systems, 2003

VII.G. Continuous Speech Recognition

The recognition of continuous speech is, at least in theory, a simple extension of the case of recognizing discrete words using phoneme models. For recognizing continuous speech, each word in the vocabulary is specified in the same way as with isolated word recognition. The key difference is that the transitions out of each word-final state point not only to the same state and a silence model, but also to the word-initial states of all words in the vocabulary. Conceptually, then, when a word-final state is exited, the next state may be the silence model, or it may be the beginning of any other word. In implementation, several issues arise with regard to controlling the complexity of the resulting very large HMM and keeping track of word-level information. One approach to the complexity issue is to use "null" states at the beginning and ending of each phoneme HMM. These null states do not generate or record any observations, and can be entered and exited during the same time frame. The advantage of null states is that the word-end states then do not require separate transitions to each word-beginning state, but a single transition to the null state. The null states can then be easily connected from word endings to word beginnings.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B0122272404001647

A Real-Time Speech Emotion Recognition System and its Application in Online Learning

Ling Cen , ... Fengye Hu , in Emotions, Technology, Design, and Learning, 2016

Real-time Recording and Recognition

In this experiment, continuous speech spanning 8.2  min was recorded with four emotions that were alternately expressed during the recording, each of which spanned for around 2   min, with an interval of 30   s. The dataset recorded in the previous offline experiment was used to train the learning model. The 8.2-minute recording was divided into 68 segments, among them 19 with neutral state, 18 with happiness, 17 with anger, and 14 with sadness. The results are shown in Table 2.2. It can be seen from Table 2.2 that the accuracy of the anger class is slightly higher than that in the offline testing. The accuracies achieved in the other categories are not less than 80%, except in the sad state. The average accuracy among all emotion categories is 78.78%. With the same training model, the average performance in the real-time experiment is lower than that based on the pre-recorded audio data, which was partially caused by the inaccuracy in speech detection and segmentation, sound variety in different recordings, and different effects in various background noise. However, the average accuracy of 78.78% achieved indicates that our system can work properly in real-time applications.

Table 2.2. Confusion matrix in the real-time experiment

Neutral Happy Angry Sad
Neutral 80 10 10 0
Happy 0 87.5 12.5 0
Angry 5.55 11.11 83.33 0
Sad 7.14 14.28 14.28 64.28

Figure 2.5 depicts the screenshot of the system output in the real-time recording and recognition experiment. The statistics of the detected emotions are shown. In the actual recording, the frequency of each of the emotions is 25%, while they vary around 25% in the classification results that are deviated a little from real labels. To cater for the requirements of some applications, we also counted the statistics for the positive class, which contains happiness and the negative class, which includes anger and sadness, as shown in Figure 2.5. Together with the emotion frequency statistics, the time duration of recording is displayed.

Figure 2.5. Screenshot of emotion recognition results with real-time recording.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780128018569000025

Speech Summarization for Tamil Language

A. NithyaKalyani , S. Jothilakshmi , in Intelligent Speech Signal Processing, 2019

7.7.2.1 Related Work on Tamil Speech Recognition

Lakshmi et al. [41] proposed the syllable-based continuous speech recognition system. Here, the group delay-based segmentation algorithm is used to segment the speech signal in both training and the testing process and the syllable boundaries are identified. In the training process, a rule-based text segmentation method is used to divide the transcripts into syllables. The syllabified text and signal are used further to annotate the spoken data. In the testing phase, the syllable boundary information is collected and mapped with the trained features. The error rate is reduced by 20% while using the group delay based syllable segmentation approach, and so the recognition accuracy is improvised.

Radha et al. [42] proposes the automatic the Tamil speech recognition system by utilizing the multilayer feed-forward neural network to increase the recognition rate. The author-introduced system, eliminates the noise present in the input audio signal by applying the preemphasis, median, average, and Butterworth filter. Then the linear predictive cepstral coefficients features are identified and extracted from the preprocessed signal. The extracted features are classified by applying the multilayer feed-forward network, which classifies the Tamil language efficiently. The performance of the system is analyzed with the help of the experimental results, which reduces the error rate and increases the recognition accuracy.

Alex Graves et al. [43] recognize the speech features by applying the deep neural network because it works well for sequential data. The network works are based on the long- and short-term memory process to analyze the interconnection between the speech features. The extracted features are classified in terms of the connectionist temporal classification process. The implemented system reduces the error rate up to 17.7%, which is analyzed using the TIMIT phoneme recognition database.

Gales and Young [44] implement the large vocabulary continuous speech recognition system for improving the recognition rate. The authors reduce the assumptions about the particular speech features, which are classified by applying the hidden Markov model. This model uses the various process like feature projection, discriminative parameter estimation, covariance modeling, adaption, normalization, multipass, and noise compensation process while detecting the speech feature. The proposed system reduces the error rate, and so the recognition rate is increased in an effective manner. Table 7.5 describes the performance of various speech recognition techniques and it demonstrates that the modified global delay function with the Gammatone wavelet coefficient approach yields the better recognition, comparatively.

Table 7.5. Comparison of Speech Recognition Techniques

Recognition Technique Recognition Accuracy in Percentages
MFCC with HMM [45] 85
Mel-frequency cepstral coefficients (MFCC) with deep neural network (DNN) [46] 82.2
Gammatone cepstral coefficients (GTCC) with hidden Markov model (HMM) [47] 85.6
Gammatone cepstral coefficients (GTCC) with Deep Neural Network (DNN) [48] 88.32
Modified global delay function (MGDF) with Gammatone wavelet coefficient approach [49] 98.3
Syllable-based continuous approach [41] 80

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780128181300000076

Using SONNET 1 to Segment Continuous Sequences of Items

Albert Nigrin , in Neural Networks and Pattern Recognition, 1998

8.2 The Classification of Embedded Patterns

An additional property that classifying networks should achieve is the ability to classify patterns that are surrounded by extraneous information. This is essential in areas such as continuous speech, where there are usually no clear-cut boundaries between words. One way a network can deal with the extraneous information is to use both inhomogeneous nodes and a nonuniform pattern of connectivity between the nodes. In SONNET 1, the nodes evolve to have different input/output characteristics, and the connectivity pattern evolves so that nodes inhibit only other nodes that classify similar patterns.

One possible justification for the necessity of inhomogenous nodes concerns the predictive power of the classifying nodes and has been discussed elsewhere (Cohen and Grossberg, 1986, 1987; Marshall, 1990a, 1990b, 1992, 1995; Nigrin, 1990, 1993). Another justification arises if we analyze the structure that a network must have if it is to satisfy two simple constraints: (1) The network should be able to classify patterns that are surrounded by extraneous information; and (2) The network should be able to make clear-cut decisions.

For example, suppose some F (2) cell x car represents the pattern CAR. The first constraint implies that x car should receive the full input that is possible for it, even when additional items like I or S are present in an input pattern like CARIS. Otherwise, if the presence of extraneous items reduced the input to x car significantly, then x car would not be able to activate when its pattern was embedded in larger patterns (as is often the case in speech signals).

The second constraint implies that when multiple classifications are competing for an input pattern, then the network should choose whichever cell best represents the pattern and allow that cell to fully activate, while suppressing the activity of other cells. For example, if CAR is presented to a network that has the classifications x car and x carGO, then x car should fully activate and x carGO should be suppressed, even though x cargo partially represents the input pattern. Conversely when CARGO is presented, x cargo should fully activate and x car should be suppressed. This is true even though the pattern that x car represents is entirely present, and therefore (by the first constraint) x car must receive the full input that is possible for it!

To allow a single network to be able to satisfy both constraints simultaneously, it must have some kind of inhomogeneity in the structure of its classifying cells. One possible inhomogeneity that solves the problem involves the use of different cell "sizes," with larger cells classifying larger patterns and smaller cells classifying smaller patterns. Larger cells dilute their input (both excitatory and inhibitory) to a greater degree than do smaller cells. Thus, they are difficult to turn on, and they respond well only to larger patterns. However, once the larger cells are activated, they are difficult to turn off, and thus they inhibit smaller cells more easily than the reverse. For example, when the word CAR is presented, x cargo does not receive enough input to activate, thus allowing x car to activate. However, when the word CARGO is presented, the node x cargo receives enough input to activate, and through unequal competition it can suppress (mask out) the activity of xcar

A second reason to prefer inhomogeneous nodes is called the temporal chunking problem (Grossberg, 1982, 1987b, 1988). Suppose that some pattern ABCD is presented at F (1) Furthermore, suppose that all the subparts of that pattern already exist as classifications, so that different F (2) nodes already code the patterns A, B, C, and D. If the F (2) nodes were homogeneous, then the pattern ABCD would continually be processed as subparts instead of eventually being treated as a unified whole. (A more realistic example occurs when the network should learn the word CARGO, even after it has established categories for CAR and GO.) To prevent this, there must be some mechanism that favors the formation of larger categories.

A second area of nonuniformity in the structure of the classifying field concerns the inhibitory connections within the field (Cohen and Grossberg, 1986, 1987; Marshall, 1990a, 1990b, 1992, 1995; Nigrin, 1990, 1993). In SONNET 1, nodes compete only with other nodes that attempt to classify similar patterns. This nonuniformity increases the power of the network, as the following example shows. Suppose that the lists AB, CD, and ABC have been learned. (Consider these lists to be abstractions for the spoken words ALL, ALTER, and TURN.) When ABC is presented, x abc should activate and x ab and x Cd should be inhibited. However, when ABCD is presented, the reverse should be true. The list should be segmented as AB and CD, with x abc inhibited, since it is not part of the segmentation.

This will not happen if the connections are homogeneous. Since x abc must activate whenever ABC is presented, it must be true that neither x ab nor x cd can individually suppress the activity of x abc When ABCD is presented, only by combining inhibition can x ab and x cd possibly mask out x abc However, if the connections are uniform, then x ab and x cd will inhibit each other as much as they inhibit x abc Consequently, x abc will activate, even for ABCD.

To remedy this, F (2) nodes should inhibit only other nodes in F (2) that respond to similar patterns, thus allowing multiple smaller nodes to combine and overpower larger ones. In the example above, x abc should compete with both x ab and x cd, but x ab and x cd should not compete with one another. (Another advantage to using nonuniform connections is that it allows a network to be able to classify multiple patterns simultaneously. This is a great advantage when a network is forced to operate in complex, unsegmented environments.)

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780125264204500094

Introduction

Edmund Lai PhD, BEng , in Practical Digital Signal Processing, 2003

1.4.1.3 Speech recognition

One of the major goals of speech recognition is to provide an alternative interface between human user and machine. Speech recognition systems can either be speaker dependent or independent, and they can either accept isolated utterances or continuous speech. Each system is capable of handling a certain vocabulary.

The basic approach to speech recognition is to extract features of the speech signals in the training phase. In the recognition phase, the features extracted from the incoming signal are compared to those that have been stored. Owing to the fact that our voice changes with time and the rate at which we speak also varies, speech recognition is a very tough problem. However, there are now commercially available some relatively simple small vocabulary, isolated utterance recognition systems. This comes about after 30 years of research and the advances made in DSP hardware and software.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780750657983500011

The Synchronous Approach to Reactive and Real-Time Systems

ALBERT BENVENISTE , ... GÉRARD BERRY , in Readings in Hardware/Software Co-Design, 2002

C A Second Case Study: Speech Recognition Systems

Speech recognition systems are do not bear hard real-time constraints: the time response between the input (spoken language) and the output (text on screen or input to some other system) may be only loosely constrained. Nevertheless, the continuous speech signal must be processed on-line to avoid unbounded buffering. Hence, continuous speech recognition is a good prototype of application where high-speed numerical preprocessing as well as complex symbolic postprocessing is required. Similar examples are found in data communication, pattern recognition, military systems, process monitoring, and troubleshooting systems. We describe here briefly the speech-to-phoneme recognition system developed at IRISA [7]. Its overall organization is shown in the Fig. 1. The originality of this system lies in its use of a segmentation of the continuous speech signal prior to any recognition. The automaton supervises the segmentation; it fires small modules to compute cepstra, a representation of the spectral characteristics of the signal, associated with detected segments as well as some acoustic/phonetic cues. All these modules are numerically oriented. Finally, high level processing is performed following a technique close to Hidden-Markov Model (HMM) methods [22]: maximum likelihood decoding based on a stochastic automaton. This is again a numerically as well as logically oriented module.

Fig. 1. A speech-to-phoneme recognition system.

To illustrate further how signal processing algorithms may give rise to reactive systems, let us give additional details on the segmentation module. The outcome of this processing is shown in the Fig. 2. The segmentation procedure is mainly numerically oriented and is performed on-line. Detection of change occurs with a bounded delay, so that the speech signal must be reprocessed from the estimated change time. Furthermore, some local backward processing of the speech signal is also needed. Hence, while this is still a real-time processing of speech signal, its timing is far from being trivial. Therefore, writing a real-time oriented programming of this processing in C or Fortran is a tedious and error-prone task.

Fig. 2. The segmentation module. The detected segments are superimposed on the signal (top line). Subsequent lines show the behavior of several auxiliary quantities (the divergence tests) that are computed on-line to perform the segmentation. As a by-product of the processing, the auxiliary labels "v" and "nv" indicate voiced and unvoiced segments, respectively.

To summarize, this example is a good prototype of a complex real-time signal processing application. It may be compared to radar systems for example.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9781558607026500132

Cognitive Computing: Theory and Applications

S. Jothilakshmi , V.N. Gudivada , in Handbook of Statistics, 2016

5.1 Non-Automatic Speech Recognizer Based Prosodic Feature Extraction

Commonly used speech cues are onset and offset of a syllable, and start and end of voicing. For extracting syllable-based prosodic features, the speech signal should be segmented into syllables. All spoken utterances can be considered as sequence of syllables. The syllable-like regions in continuous speech can be identified using locations of vowel onset points (VOPs). The latter refers to the instant at which the onset of vowel takes place in a syllable (Mary and Yegnanarayana, 2008).

There are various methods to detect pitch from the voiced speech signal (Babacan et al., 2013; Gerhard, 2003). Time domain methods include zero crossings and autocorrelation. Frequency domain methods include harmonic product spectrum, cepstrum, and filters. Some hybrid approaches use both time and frequency domain methods.

Intonation or pitch contour (F 0) is the pitch pattern of an utterance. The direction of F 0 changes by rising or falling with time. Pitch (F 0) contour of a speech signal is associated with VOPs locations. Speaking rate is the number of syllables spoken per second. It indicates how fast the user spoke. The duration is the time taken to speak a syllable (Mary, 2012).

The syllables in a language are categorized as strong (stressed) or weak (unstressed) syllables. They are language specific. Stress defines the rhythm of speech. It indicates the relative prominence of a syllable. Stressed syllables are louder and longer in duration. They are characterized by higher energy, larger pitch (F 0) movement, and longer duration. Stress is represented by the change in log energy corresponding to the voiced regions of a syllable, the F 0 contour, and duration features.

Syllabic rhythm is represented by the distance between successive VOPs and the duration of voiced region within each syllable-like region. Loudness distinguishes sentences that are spoken with different volume. Usually the average and maximum volume levels from the start to the end of the sentence are measured. Volume is generally used to show emotions such as fear or anger.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/S0169716116300463

Wizard of Oz

Saul Greenberg , ... Bill Buxton , in Sketching User Experiences: The Workbook, 2012

Example 1: The Listening Typewriter

In 1984, senior executives did not normally use computers. The issue was that they saw typing as something that secretaries did. To solve this problem, John Gould and his colleagues at IBM wanted to develop a 'listening typewriter', where executives would dictate to the computer, using speech to compose and edit letters, memos, and documents. While such speech recognition systems are now commonly available, at that time Gould didn't know if such systems would actually be useful or if it would be worth IBM's high development cost. He decided to prototype a listening typewriter using Wizard of Oz . He also wanted to look at two conditions: a system that could understand isolated words spoken one at a time (i.e., with pauses between words), and a continuous speech recognition system. Our example will illustrate the isolated words condition.

Note

The Wizard of Oz method is named after the well-known 1939 movie of the same name. The Wizard is an intimidating being who appears as a large disembodied face surrounded by smoke and flames and who speaks in a booming voice. However, Toto the dog exposes the Wizard as a fake when he pulls away a curtain to reveal a very ordinary man operating a console that controls the appearance and sound of the Wizard.

Around 1980, John Kelley adapted the 'Wizard of Oz' term to experimental design, where he acted as the 'man behind the curtain' to simulate a computer's response to people's natural language commands.

John Gould and others popularized the idea in 1984 through his listening typewriter study, as detailed in Example 1.

What the User Saw

The sketch below shows what the user – an executive – would see. The user would speak individual words into the microphone, where the spoken words (if known by the computer) would appear as text on the screen. If the computer did not know the word, a series of XXXXs would appear. The user would also have the opportunity to correct errors by special commands. For example, if the user said 'NUTS' the computer would erase the last word, while NUTS 5 would erase the last 5 words. Other special words let the user tell the computer to spell out unknown words (via 'SPELLMODE' and 'ENDSPELLMODE', and add formatting (e.g., 'CAPIT' to capitalize the first letter of a word, and 'NEWPARAGRAPH' for a new paragraph).

What Was Actually Happening

Computers at that time could not do this kind of speech recognition reliably, so Gould and colleagues simulated it. They used a combination of a human Wizard that interpreted what the user said, and a computer that had simple rules for what to do with words typed to it by the Wizard. The sketch below shows how this was done. The Wizard typist, located in a different room, listened to each word the user said, and then typed that word into the computer. The computer would then check each word typed to see if it was in its limited dictionary. If it was, it would display it on the user's screen. If it wasn't it would display XXXXs. For special command words, the typist would enter an abbreviation of it, which the computer would interpret as a command to trigger the desired effect on the user's screen.

Tip

The Designer benefits by becoming the Wizard due to the "training" received by playing the role of Wizard Being personally responsible for a person's discomfort and confusion motivates revisions! And simulating an incomplete design reveals its ill-defined aspects.

Keeping It Real

The chief danger in using a Wizard of Oz is that the Wizard can have powers of comprehension that no system could have. For example, a human Wizard can understand complex speech or gestural input that cannot be implemented reliably. The solution – as done in the listening typewriter – is to limit the Wizard's intelligence to things that can be implemented realistically (see the 1993 paper by Maulsby, Greenberg, and Mander for more details and another example of how this can be done).

1.

The Wizard's understanding of user input is based on a constrained input interaction model that explicitly lists the kinds of instructions that a system – if implemented – can understand and the feedback it can formulate. For example, even though the typist could understand and type continuous speech, the typist was told to listen to just a single spoken word, type in that spoken word, and then hit <enter>. The typist was also told to recognize and translate certain words as commands, which were then entered as an abbreviation. The computer further constrained input by recognizing only those words and commands in its limited vocabulary.

2.

The Wizard's response should be based on an 'algorithm' or 'rules' that limits the actions it takes to those that can be realistically implemented at a later time. For example, if the user said "type 10 exclamation marks", the Wizard's algorithm would be to just type in that phrase exactly, rather than '!!!!!'. Similarly, the computer substituted XXXXs for words it could not understand, and could respond to only a small set of editing commands.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780123819598500316

Robust Speech Recognition Under Noisy Ambient Conditions

Kuldip K. Paliwal , Kaisheng Yao , in Human-Centric Interfaces for Ambient Intelligence, 2010

6.1 Introduction

Ambient intelligence is the vision of a technology that will become invisibly embedded in our surroundings, enabled by simple and effortless interactions, context sensitive, and adaptive to users [1]. Automatic speech recognition is a core component that allows high-quality information access for ambient intelligence. However, it is a difficult problem and one with a long history that began with initial papers appearing in the 1950s [2, 3]. Thanks to the significant progress made in recent years in this area [4, 5] speech recognition technology, once confined to research laboratories, is now applied to some real-world applications, and a number of commercial speech recognition products (from Nuance, IBM, Microsoft, Nokia, etc.) are on the market. For example, with automatic voice mail transcription by speech recognition, a user can have a quick view of her voice mail without having to listen to it. Other applications include voice dialing on embedded speech recognition systems.

The main factors that have made speech recognition possible are advances in digital signal processing (DSP) and stochastic modeling algorithms. Signal processing techniques are important for extracting reliable acoustic features from the speech signal, and stochastic modeling algorithms are useful for representing speech utterances in the form of efficient models, such as hidden Markov models (HMMs), which simplify the speech recognition task. Other factors responsible for the commercial success of speech recognition technology include the availability of fast processors (in the form of DSP chips) and high-density memories at relatively low cost.

With the current state of the art in speech recognition technology, it is relatively easy to accomplish complex speech recognition tasks reasonably well in controlled laboratory environments. For example, it is now possible to achieve less than a 0.4% string error rate in a speaker-independent digit recognition task [6 ]. Even continuous speech from many speakers and from a vocabulary of 5000 words can be recognized with a word error rate below 4% [ 7]. This high level of performance is achievable only when the training and the test data match. When there is a mismatch between training and test data, performance degrades drastically.

Mismatch between training and test sets may occur because of changes in acoustic environments (background, channel mismatch, etc.), speakers, task domains, speaking styles, and the like [8]. Each of these sources of mismatch can cause severe distortion in recognition performance for ambient intelligence. For example, a continuous speech recognition system with a 5000-word vocabulary raised its word error rate from 15% in clean conditions to 69% in 10-dB to 20-dB signal-to-noise ratio (SNR) conditions [9, 10]. Similar degradations in recognition performance due to channel mismatch are observed. The recognition accuracy of the SPHINX speech recognition system on a speaker-independent alphanumeric task dropped from 85% to 20% correct when the close-talking Sennheiser microphone used in training was replaced by the omnidirectional Crown desktop microphone [11]. Similarly, when a digital recognition system is trained for a particular speaker, its accuracy can be easily 100%, but its performance degrades to as low as 50% when it is tested on a new speaker.

To understand the effect of mismatch between training and test conditions, we show in Figure 6.1 the performance of a speaker-dependent, isolated-word recognition system on speech corrupted by additive white noise. The recognition system uses a nine-word English e-set alphabet vocabulary where each word is represented by a single-mixture continuous Gaussian density HMM with five states. The figure shows recognition accuracy as a function of the SNR of the test speech under (1) mismatched conditions where the recognition system is trained on clean speech and tested on noisy speech, and (2) matched conditions where the training and the test speech data have the same SNR.

Figure 6.1. Effect of additive white noise on speech recognition performance under matched and mismatched conditions: training with clean speech (dotted line); training and testing with same-SNR speech (solid line).

It can be seen from Figure 6.1 that the additive noise causes a drastic degradation in recognition performance under the mismatched conditions; with the matched conditions, however, the degradation is moderate and graceful. It may be noted here that if the SNR becomes too low (such as −10 dB), the result is very poor recognition performance even when the system operates under matched noise conditions. This is because the signal is completely swamped by noise and no useful information can be extracted from it during training or in testing.

When a speech recognition system is deployed in a real-life situation for ambient intelligence, there is bound to be a mismatch between training and testing that causes severe deterioration in recognition performance. The aim of a robust speech recognition system is to remove the effect of mismatch and achieve performance that is as graceful as obtained under matched conditions.

Note that devices used for ambient intelligence are usually small, low power, low weight, and (very important) low cost. A successful speech recognition system therefore needs to consider factors of practical implementation and system usage. These challenges include but are not limited to dealing with large volumes of incoming recognition requests, prompt response, and hardware constraints such as low memory and fixed-point arithmetic on DSP chips.

In this chapter, we provide only a glimpse of robust speech recognition and describe briefly some of the popular techniques used for this purpose. (For more details, see [12–21].) We will focus here mainly on techniques to handle mismatches resulting from changes in acoustic environments (e.g., channel and noise distortions). Some of these are equally applicable to mismatches resulting from speaker variability. The chapter is organized as follows: Section 6.2 provides a brief overview of the automatic speech recognition process. Different sources of variability in speech signals are discussed in Section 6.3. Robust speech recognition techniques are briefly described in Section 6.4. Section 6.5 concludes the chapter.

Read full chapter

URL:

https://www.sciencedirect.com/science/article/pii/B9780123747082000061

fostertoom1989.blogspot.com

Source: https://www.sciencedirect.com/topics/computer-science/continuous-speech

Post a Comment for "Is Continuous Speech a Series of Interconnected Phonemes"