INTRODUCTION:The construction of standard speech database is unavoidablerequirement for the progressive development of speech recognition andunderstanding systems. The past three decades have seen a steady growth ofinterest in corpus-based techniques for speech and natural language processingthroughout the world. Corpus-based methods are found at the heart of manylanguage and speech processing systems. Though Bangla is one of the most widelyspoken languages spoken by about 245 million people around the world, the historyof corpus generation and corpus based Bangla speech recognition are not so farand limited within few years. Among very few examples of Bangla speech corpora,probably the first instance was C-DAC’s Bangla Katha Bhandar released in 20051. It was a product of Center for Development of Advanced Computing (CDAC) ofIndiaand is a collection of Annotated Speech Corpus for Bangla.
Another step ofsimilar work was done by the Center for Research on Bangla Language Processingof Bangladesh in 2010 2. In between these two, a research project financed bythe MOSICT of Bangladesh was completed in June, 2008. Under this project alarge scale speech corpora were recorded in SIPL of Islamic University 3. Thedistinction of The SIPL speech corpora from other two is that it was designedespecially for Bangla speech recognition.
As the continuation of the projectresults organizing, labeling and similar other processing is still ongoing. Inthis paper describes the design and development processes of connected wordspeech corpus. After the basics of speech corpora, a sort description of BdNC01text corpus has been discussed to understand the selection of words for speechdatabase design. In the next subsections, speech recording, editing processesand final outcome are discussed. The paper concludes with the usability of thecorpus. 2.
SPEECH CORPUS FUNDAMENTALSA corpus is a collection of pieces of language text inelectronic form, selected according to external criteria to represent, as faras possible, a language or language variety as a source of data for linguisticresearch 4. A speech corpus or spoken corpus is a database of speech audiofiles and text transcriptions in a format that can be used to create acousticmodels which can then be used with a speech recognition engine 5. In broadsense, Speech Corpora may be viewed in two types as below:1. Read Speech – This includes Book excerpts, Broadcastnews, Lists of words and Sequences of numbers.2.
Spontaneous Speech – This includes Dialogs between two ormore people (includes meetings), Narratives such as a person telling a story, Map-taskssuch as one person explains a route on a map to another and Appointment-tasks suchas two people try to find a common meeting time based on individual schedules.Special kinds of speech corpora are non-native speechdatabases that contain speech with foreign accent.Speech corpus is the basis for both analyzing thecharacteristics of speech signal and developing speech synthesis andrecognition systems. The corpus content becomes more and more complicated andthe size larger and larger with the development of computation power and thespeech technology. One of the selection methods of speech content of a corpusis to derive the speech corpus from text corpus. For example, a speech corpusof British English WSJCAM0 has been recorded at Cambridge Universityfrom the Wall Street Journal text corpus 6. Before recording a speech corpus, carefulselection of vocabulary is important since on average each out-of-vocabulary wordcauses errors usually between 1.5 and 2 7.
The recognizer vocabulary isusually designed with the goal of maximizing lexical coverage for the expectedinput. A straight forward approach is to choose the N most frequent words inthe training data which means that the usefulness of the vocabulary is highlydependent upon the representativeness of the training data 8.There aredifferent parameters to categorize a speech recognition system. Influentialparameters are speech types, speaker dependency, vocabulary size, etc. The importanceof these parameters is based upon the typical design considerations of arecognition system, which may be closely related to a specific application ortask 9. In terms of speech types, speech recognition devices are usually facesrecognition problems with isolated or discrete, connected, or continuousspeech. Discrete speech requires a significant pause between words, may be 250milliseconds. A single utterance may consist of a single word or a short stringof a number of isolated words.
In continuous speech recognition systems, fluent or continuous speech flows with a rhythm and thewords bump into each other thus making recognition harder. In betweenthese two, connected speech recognizers do not require the intermediate pausebetween inputs, but are able to detect word boundaries within a string ofconnected speech. They do, however, require that the user carefully annunciateeach word like a dictation. Though many relevant literatures describe connectedwords and continuous words as alternative terms, but because of vast diversityof application it is required to define connected words separately. In speech recognition task, the difference in classificationbetween “connected words” and “continuous speech” is somewhattechnical. A connected word recognizer uses words as recognitionunits, which can be trained in an isolated word mode.
Particularlyin dictation and voice command recognition this type of systems becomesefficient. Discrete, connected, and continuous speech recognition systems canbe classified further as either speaker-dependent or speaker-independentsystems. Speaker-dependent systems require that each speaker enter severalsamples of each word in the vocabulary to form the reference templates 10. Anotherimportant consideration to design a speech corpus is its vocabulary size.
Theadjectives “small”, “medium” and “large” are applied to vocabulary sizes of the order of 100, 1000 and (over) 5000 words,respectively. A typical small vocabulary recognizer can recognizeonly ten digits; a typical large vocabulary recognition system can recognize 20000 words 9. Indictation and voice command recognition medium size vocabulary may be estimatedenough for satisfactory performance.
Because it supports the study of Gould, Conti, and Hovanyecz 10 to determine thefeasibility of a limited capability automatic dictation machine which was simulatedalong with isolated and connected speech modes using various vocabulary sizes. In their experiment users composed and editedletters with the simulated voice recognizer which had either a 1000 wordvocabulary or an unlimited vocabulary. The 1000 word vocabulary was composed ofthe 1000 most frequently used English words.
An analysis afterwards indicatedthat roughly 75% of the words used in the letter writing task were available inthe 1000 word vocabulary. 3. BdNC01 CORPUS AND DATABASE DESIGNBdNC01 corpus is a text corpus collected from web editionof several influential Bangla newspapers during 2005-2011. BdNC01 contains alarge amount of Bangla text including more than 11 million word tokens. As arequirement of this work, a program was developed using C Language to parse andsort the text in BdNC01 corpus, the result was a list of words with their frequencyof occurrence in the text. The objective of this processing was to select alist of high frequent 1000 or more words so that it becomes a goodrepresentative of the language in consideration to construct a significantconnected speech database. A part of the list is shown in Table-1 and topfrequent 1000 words were selected to find some practical Bangla sentences.
Fromthree issues of daily newspapers selected randomly, 52 sentences were selectedsuch that they include high frequent words as above. The list of sentences wasaccepted for a small-medium vocabulary speech database and includes 252different words in 343 places. The special characteristic of this list is thatsome words are in multiple places with different context.