OLAC Record oai:www.ldc.upenn.edu:LDC2023S01 |
Metadata | ||
Title: | AIDA Ukrainian Broadcast and Telephone Speech Audio and Transcripts | |
Access Rights: | Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining | |
Bibliographic Citation: | Delgado, Dana, et al. AIDA Ukrainian Broadcast and Telephone Speech Audio and Transcripts LDC2023S01. Web Download. Philadelphia: Linguistic Data Consortium, 2023 | |
Contributor: | Delgado, Dana | |
Walker, Kevin | ||
Graff, David | ||
Strassel, Stephanie | ||
Date (W3CDTF): | 2023 | |
Date Issued (W3CDTF): | 2023-01-17 | |
Description: | *Introduction* AIDA Ukrainian Broadcast and Telephone Speech Audio and Transcripts was developed by the Linguistic Data Consortium (LDC) and is comprised of approximately 156 hours of Ukrainian conversational telephone speech (CTS) and broadcast news audio (BN) with 1.2 million words of corresponding orthographic transcripts. The broadcast recordings and transcripts were produced to support the DARPA AIDA (Active Interpretation of Disparate Alternatives) program which aimed to develop a multi-hypothesis semantic engine to generate explicit alternative interpretations of events, situations and trends from a variety of unstructured sources. LDC supported AIDA by collecting, creating and annotating multimodal linguistic resources in multiple languages. The telephone speech audio recordings were collected to support the NIST 2011 Language Recognition Evaluation which focused on pair discrimination for 24 languages/dialects. These recording are also contained in Multi-Language Conversational Telephone Speech 2011 – Slavic Group LDC2016S11. The goal of NIST’s LRE series is to establish the baseline of current performance capability for language recognition of conversational telephone speech and to lay the groundwork for further research efforts in the field. *Data* The CTS audio data was generated from telephone calls by native Ukrainian speakers to acquaintances in their social network. It was collected using LDC's telephone infrastructure comprised of three computer telephony systems. Human auditors labeled calls for callee gender, dialect type and noise. All CTS audio files were originally collected as 2-channel u-law and were converted to 8KHz 16-bit pcm and flac compressed for release. The BN data was taken from 87 news recordings broadcast by various Ukrainian sources. All BN audio files were originally collected as mp3 via web-download or as live streaming broadcast captures and were downsampled to either 16KHz or 22KHz 16-bit pcm and flac compressed for release. Native Ukrainian speakers manually segmented the data into sentence-level units as part of the transcription process. All transcripts are delivered as *.tsv tab delimited files that include metadata and statistics. *Sponsorship* This material is based upon work supported by the Defense Advanced Research Projects Agency (DARPA) under Contract Nos. HR0011-15-C-0123 and FA8750-18-C-0013. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of DARPA. *Samples* Please view these samples: * Audio Sample (FLAC) * Transcript Sample (TSV) *Updates* None at this time. | |
Extent: | Corpus size: 10246124 KB | |
Format: | Sampling Rate: CTS 8KHz 16-bit pcm, BN 16KHz or 22KHz 16-bit pcm | |
Sampling Format: CTS 8KHz 16-bit pcm, BN 16KHz or 22KHz 16-bit pcm | ||
Identifier: | LDC2023S01 | |
https://catalog.ldc.upenn.edu/LDC2023S01 | ||
ISLRN: 699-485-644-732-3 | ||
DOI: 10.35111/qge4-4f15 | ||
Language: | Ukrainian | |
Language (ISO639): | ukr | |
License: | LDC User Agreement for Non-Members: https://catalog.ldc.upenn.edu/license/ldc-non-members-agreement.pdf | |
Medium: | Distribution: Web Download | |
Publisher: | Linguistic Data Consortium | |
Publisher (URI): | https://www.ldc.upenn.edu | |
Relation (URI): | https://catalog.ldc.upenn.edu/docs/LDC2023S01 | |
Rights Holder: | Portions © 2017 Crimean Radio and Television Company, © 2017-2018 Hromadske Radio, © 2017-2018 LiveOnlineRadio.Net, © 2017-2018 Radio of Ukraine, © 2017-2018 Radio Vesti, © 2017-2018 RFE/RL, Inc., © 2016, 2018, 2022, 2023 Trustees of the University of Pennsylvania | |
Type (DCMI): | Sound | |
Text | ||
Type (OLAC): | primary_text | |
OLAC Info |
||
Archive: | The LDC Corpus Catalog | |
Description: | http://www.language-archives.org/archive/www.ldc.upenn.edu | |
GetRecord: | OAI-PMH request for OLAC format | |
GetRecord: | Pre-generated XML file | |
OAI Info |
||
OaiIdentifier: | oai:www.ldc.upenn.edu:LDC2023S01 | |
DateStamp: | 2024-01-01 | |
GetRecord: | OAI-PMH request for simple DC format | |
Search Info | ||
Citation: | Delgado, Dana; Walker, Kevin; Graff, David; Strassel, Stephanie. 2023. Linguistic Data Consortium. | |
Terms: | area_Europe country_UA dcmi_Sound dcmi_Text iso639_ukr olac_primary_text |