TAL also contains a diverse set of speaker accents, varying rates of speech, and background music, making it an acoustically challenging dataset. Thus, transcription models for this setting must capture higher-level semantics of the utterance. Professional transcribers for TAL may elect to ignore stutters and irrelevant repetitions, performing minor grammatical fixes to the spoken words. In contrast to other ASR datasets, TAL transcripts contain proper punctuation and casing. We note that RadioTalk also collected a large corpus of radio program transcripts, but did so using a noisy automated system with no corresponding gold labels and did not release the audio. Of these datasets, only LibriSpeech and TAL are free and openly accessible. Ī large body of research in ASR has focused on the 1,000-hour LibriSpeech dataset of audiobook segments, while SD research focuses on telephone conversation transcripts from the Fisher, CALLHOME and Switchboard corpora. We compare TAL to several benchmark datasets for ASR and SD in Table 1, alongside the clinical dataset used in. MWDE generalizes the previously proposed two-speaker word diarization error rate to multiple speakers. To benchmark performance in each setting, we measure the transcription error via word error rate (WER) and introduce a new metric, multi-speaker word diarization error (MWDE), to evaluate word-level speaker alignment. We propose two tasks for joint ASR and diarization: TAL aligned and unaligned, to evaluate models under situations where utterance bounds are either provided or unknown respectively. TAL is unique in two ways: each episode is an hour-long conversation and contains an average of 18 unique speakers in three roles. We introduce a benchmark dataset for this setting, consisting of 663 podcast episodes and transcripts collected from the weekly This American Life (TAL) radio program. To further explore these types of end-to-end approaches, we expand the joint framework to encompass ASR and SD in an open-domain setting for extended multi-speaker conversations. Recent work has shown promising results in learning sequence transduction models that jointly perform ASR and SD in a two-speaker clinical setting by simply adding a speaker change token to the model’s vocabulary.
0 Comments
Leave a Reply. |
AuthorWrite something about yourself. No need to be fancy, just an overview. ArchivesCategories |