Sample Metadata Record

oai:www.ldc.upenn.edu:LDC2008T01


XML format

<olac:olac>
<dc:title>Hungarian-English Parallel Text, Version 1.0</dc:title>
<dc:contributor>Varga, Dániel</dc:contributor>
<dc:contributor>Németh, László</dc:contributor>
<dc:contributor>Halácsy, Péter</dc:contributor>
<dc:contributor>Kornai, András</dc:contributor>
<dc:contributor>et al.</dc:contributor>
<dc:date xsi:type="dcterms:W3CDTF">2008</dc:date>
<dcterms:issued xsi:type="dcterms:W3CDTF">2008-01-22</dcterms:issued>
<dc:description>*Introduction* Hungarian-English Parallel Text, Version 1.0 (also known as the "Hunglish Corpus") is a sentence-aligned Hungarian-English parallel corpus consisting of approximately two million sentence pairs. The corpus contains additional language resources for the Hungarian text, including a monolingual corpus, morphological toolset and aligner. Hungarian-English Parallel Text, Version 1.0 is a joint work of the Media Research and Education Center at the Budapest University of Technology and Economics (BUTE) and the Corpus Linguistics Department at the Hungarian Academy of Sciences Institute of Linguistics. Additional information about this release is available from the corpus website maintained by BUTE. *File formats, character encoding* This publication is issued on CD as a tarred zip file. Commonly available utilities such as Gnu Zip or Stuffit will readily extract this publication from its compressed form. Sentence pair (.bi) files consist of tab-separated, matching sentence pairs. The .bi files do not contain segments where deletion or contraction occurred. They are also filtered based on quality, so the full reconstruction of the raw texts is impossible. Some .bi files were shuffled (sorted alphabetically). Alignment "ladder" (.lad) files preserve the whole of both input texts with ordering, even those segments that were not successfully aligned. In .lad files, every line is tab-separated into two columns. The first is a segment of the Hungarian text. The second is a (supposedly corresponding) segment of the English text. Such segments of the source or target text will generally consist of exactly one sentence on both sides, but can also consist of zero, or more than one, sentence. In the latter case, the special separating token " ~~~ " is placed between sentences. The encoding of the sentence pair and the alignment files is mixed: ISO Latin-2 on the Hungarian side, and ISO Latin-1 on the English side. The overwhelming majority of the texts use compatible subsets of these two encodings, so for viewing, the files can be considered ISO Latin-2 encoded. hu and en are the raw texts used, in ISO Latin-2 and ISO Latin-1 encoding respectively. *Samples* For an example of the data contained in this corpus, please examine this sample screen capture of bilingual text.</dc:description>
<dcterms:extent>Corpus size: 1992294 KB</dcterms:extent>
<dcterms:medium>Distribution: Web Download</dcterms:medium>
<dc:identifier>LDC2008T01</dc:identifier>
<dc:identifier>https://catalog.ldc.upenn.edu/LDC2008T01</dc:identifier>
<dc:identifier>ISBN: 1-58563-461-1</dc:identifier>
<dc:identifier>ISLRN: 694-868-944-045-4</dc:identifier>
<dc:identifier>DOI: 10.35111/khb2-hh45</dc:identifier>
<dcterms:bibliographicCitation>Varga, Dániel, et al. Hungarian-English Parallel Text, Version 1.0 LDC2008T01. Web Download. Philadelphia: Linguistic Data Consortium, 2008</dcterms:bibliographicCitation>
<dc:language xsi:type="olac:language" olac:code="hun">Hungarian</dc:language>
<dc:publisher>Linguistic Data Consortium</dc:publisher>
<dc:publisher xsi:type="dcterms:URI">https://www.ldc.upenn.edu</dc:publisher>
<dc:relation xsi:type="dcterms:URI">https://catalog.ldc.upenn.edu/docs/LDC2008T01</dc:relation>
<dcterms:accessRights>Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining</dcterms:accessRights>
<dcterms:license>Hungarian-English Parallel Text, Version 1 Agreement: https://catalog.ldc.upenn.edu/license/hungarian-english-parallel-text-version-1.pdf</dcterms:license>
<dc:rightsHolder>Portions © 2005 Budapest University of Technology and Economics, © 2005 Hungarian Academy of Sciences Institute of Linguistics, © 2005 Diplomacy and Trade Magazine, © 1996, 2008 Trustees of the University of Pennsylvania</dc:rightsHolder>
<dc:type xsi:type="olac:linguistic-type" olac:code="primary_text"/>
<dc:type xsi:type="dcterms:DCMIType">Text</dc:type>
</olac:olac>

Display format

 Title  Hungarian-English Parallel Text, Version 1.0
 Contributor  Varga, Dániel
 Contributor  Németh, László
 Contributor  Halácsy, Péter
 Contributor  Kornai, András
 Contributor  et al.
 Date  (W3CDTF)  2008
 Is su ed (W3CDTF)  2008-01-22
 Description  *Introduction* Hungarian-English Parallel Text, Version 1.0 (also known as the "Hunglish Corpus") is a sentence-aligned Hungarian-English parallel corpus consisting of approximately two million sentence pairs. The corpus contains additional language resources for the Hungarian text, including a monolingual corpus, morphological toolset and aligner. Hungarian-English Parallel Text, Version 1.0 is a joint work of the Media Research and Education Center at the Budapest University of Technology and Economics (BUTE) and the Corpus Linguistics Department at the Hungarian Academy of Sciences Institute of Linguistics. Additional information about this release is available from the corpus website maintained by BUTE. *File formats, character encoding* This publication is issued on CD as a tarred zip file. Commonly available utilities such as Gnu Zip or Stuffit will readily extract this publication from its compressed form. Sentence pair (.bi) files consist of tab-separated, matching sentence pairs. The .bi files do not contain segments where deletion or contraction occurred. They are also filtered based on quality, so the full reconstruction of the raw texts is impossible. Some .bi files were shuffled (sorted alphabetically). Alignment "ladder" (.lad) files preserve the whole of both input texts with ordering, even those segments that were not successfully aligned. In .lad files, every line is tab-separated into two columns. The first is a segment of the Hungarian text. The second is a (supposedly corresponding) segment of the English text. Such segments of the source or target text will generally consist of exactly one sentence on both sides, but can also consist of zero, or more than one, sentence. In the latter case, the special separating token " ~~~ " is placed between sentences. The encoding of the sentence pair and the alignment files is mixed: ISO Latin-2 on the Hungarian side, and ISO Latin-1 on the English side. The overwhelming majority of the texts use compatible subsets of these two encodings, so for viewing, the files can be considered ISO Latin-2 encoded. hu and en are the raw texts used, in ISO Latin-2 and ISO Latin-1 encoding respectively. *Samples* For an example of the data contained in this corpus, please examine this sample screen capture of bilingual text.
 Extent  Corpus size: 1992294 KB
 Medium  Distribution: Web Download
 Identifier  LDC2008T01
 Identifier  https://catalog.ldc.upenn.edu/LDC2008T01
 Identifier  ISBN: 1-58563-461-1
 Identifier  ISLRN: 694-868-944-045-4
 Identifier  DOI: 10.35111/khb2-hh45
 Bibliographic Citation  Varga, Dániel, et al. Hungarian-English Parallel Text, Version 1.0 LDC2008T01. Web Download. Philadelphia: Linguistic Data Consortium, 2008
 Language (ISO639-3)  Hungarian [hun], Hungarian
 Publisher  Linguistic Data Consortium
 Publisher (URI)  https://www.ldc.upenn.edu
 Relation (URI)  https://catalog.ldc.upenn.edu/docs/LDC2008T01
 Access Rights  Licensing Instructions for Subscription & Standard Members, and Non-Members: http://www.ldc.upenn.edu/language-resources/data/obtaining
 License  Hungarian-English Parallel Text, Version 1 Agreement: https://catalog.ldc.upenn.edu/license/hungarian-english-parallel-text-version-1.pdf
 Rights Holder  Portions © 2005 Budapest University of Technology and Economics, © 2005 Hungarian Academy of Sciences Institute of Linguistics, © 2005 Diplomacy and Trade Magazine, © 1996, 2008 Trustees of the University of Pennsylvania
 Type (OLAC)  Linguistic type: Primary text
 Type (DCMI)  Text

Metadata quality analysis

OLAC metadata records are scored for metadata quality on a 10-point scale explained in OLAC Metadata Metrics. The score for the above record (along with comments on changes that could improve the score) is as follows:

Component + - Comments
Title   1   0 
Date   1   0 
Agent   1   0 
About   1   0 
Depth   1   0 
Content Language   1   0 
Subject Language   1   0 
OLAC Type   1   0 
DCMI Type   1   0 
Precision   0.67   0.33  For the full score, make use of at least one more encoding scheme in addition to the ones counted explicitly in other components of the score. For instance,
  • olac:role on dc:creator or dc:contributor
  • use dcterms:IMT on dc:format
Quality score  9.67