A Survey of the State of the Art in Digital Language Documentation and Description

Steven Bird and Gary Simons
Draft: 5 December 2000

About this document.
This document has been prepared in conjunction with the workshop on Web-Based Language Documentation and Description, held in Philadelphia on 12-15 December 2000. It is a follow-up to the requirements document, helping to assess the extent to which the requirements are met by the present state of the art. 2004-03-30 NOTE: This document is no longer maintained, and contains many broken hyperlinks

Whether one is collecting new language data, searching a corpus for an instance of some linguistic phenomenon, looking for dictionaries and texts from a particular language family, converting data to work with a favorite tool, cataloging language resources, or any of a host of similar tasks, one is immediately confronted with a series of questions:

What data is available?
What tools are available?
How adequate are these resources?
Who is creating and using these resources?
Where can I go for advice?

A more extensive list of such questions (with answers) is available at the LTG Helpdesk FAQ.

1. What data is available?

In recent months, we have conducted a survey of language archives [http://www.ldc.upenn.edu/exploration/survey.html]. Respondents were asked to answer the following questions:

1. Name and Location

1 Please provide the archive name, URL, host institution, country, contact person and email address.

2. Catalog

2.1 If the archive has a catalog in a standardized format, what fields does it contain? If not, what contextual information about the resources are collected? What other information would you like to collect if you could?
2.2 If the electronic catalog conforms to some standard, please tell us the name of the standard.

2.3 To what extent have the archived materials been cataloged electronically?

2.4 If there is an online public access catalog, please give its URL.

3. Holdings

3.1 What geographical regions and languages are covered?

3.2 Please give impressionistic estimates of the archive holdings for each of the data types: Texts; Wordlists, Vocabularies, Lexicons, Dictionaries; Field Notes, Correspondence, Misc files; Descriptions (Grammars, Phonologies, etc); Audio Recordings; Video Recordings.

3.3 Please list any other data types which are not included above, or any other comments on the archive holdings.

3.4 What proportion of the holdings are unique to the archive and not available elsewhere?

4. Electronic Publication

4.1 To what extent are the archive holdings published electronically, where "published" means that there is a well-defined procedure such that anyone at all can get a standard copy of the data, either on digital media or over the internet?

4.2 To what extent are the archive holdings accessible over the web?

4.3 Is permission required before materials can be accessed?

4.4 Is there any fee for materials?

4.5 How are author and/or editor defined for the electronic publications? Is there a bibliographical citation method?

4.6 Do the electronic publications have ISBN numbers?

4.7 What plans are there to expand the electronic publication of archive holdings?

5. General Issues

5.1 Who is the legal owner of archived materials? The original collector or his/her estate? The language community? The archive or its host institution? Some combination of these

5.2 Beyond legal ownership, are there any asserted or perceived moral rights concerning archived materials? Do the holders of the archive see the original speakers or their representatives as controlling publication?

5.3 In cases where no electronic publication is planned, why is this so? (e.g. funding, licensing, technical know-how, lack of interest).

5.4 Is any of the data in a proprietary format (e.g. MS Word)? If so, are there plans to transfer it to an open standard (e.g., XML)?

6. Do you have any other comments about digital archives of language material, or on this survey?

Responses were received from some twenty archives, and the completed survey forms are all available online [http://www.ldc.upenn.edu/exploration/survey/].

The full set of archives which have digital catalogs and holdings, or concrete plans for these, is listed below, with URLs and contact names.

AILLA: Archive of Indigenous Languages of Latin America
[http://uts.cc.utexas.edu/~ailla/introeng.html]
Joel Sherzer, Anthony Woodbury, University of Texas, Austin
ALMA: African Language Material Archive
[http://polyglot.lss.wisc.edu/afrst/wara.html]
Leigh Swigart, West African Research Association
ANLC: Alaska Native Language Center Archives
[http://www.uaf.edu/anlc]
Gary Holton, University of Alaska
APS: American Philosophical Society American Indian Manuscript Collections
[http://www.amphilsoc.org/library/guides/indians/]
Robert Cox, American Philosophical Society
ASEDA: Aboriginal Studies Electronic Data Archive
[http://coombs.anu.edu.au/SpecialProj/ASEDA/ASEDA.html]
Patrick McConvell, Australian Institute of Aboriginal and Torres Strait Islander Studies
BAS: Bavarian Archive of Speech Signals
[http://www.phonetik.uni-muenchen.de/Bas/BasHomeeng.html]
Florian Schiel, University of Munich
CDEL: Center for the Documentation of Endangered Languages
[http://php.indiana.edu/~aisri/lab/home.html]
Douglas Parks, Wally Hooper, Indiana University
CHILDES: Child Language Data Exchange System
[http://childes.psy.cmu.edu]
Brian MacWhinney, Carnegie Mellon University
Corpus Documentale Latinum Portugaliae
Antonio Emiliano, University of Lisbon
CNNC: Charlotte Narrative and Conversation Collection
[http://www.uncc.edu/english/cnnc/]
Boyd Davis, Pat Ryckman, University of North Carolina, Charlotte
Creolist Archives
[http://www.ling.su.se/Creole/Text_Collection.shtml]
Mikael Parkvall, University of Stockholm
CDLI: Cuneiform Digital Library Initiative
[http://cdli.ucla.edu/]
Robert Englund, UCLA
ELRA: European Language Resources Association
[http://www.icp.inpg.fr/ELRA/catalog.html]
Khalid Choukri, Paris
LACITO Linguistic Data Archive
[http://195.83.92.32/index.html.en]
Boyd Michailovsky, CNRS, Paris
Linguistic Data Consortium
[http://www.ldc.upenn.edu/Catalog/]
Mark Liberman, University of Pennsylvania
LPCA: Language and Popular Culture in Africa Text Archives
[http://www.pscw.uva.nl/lpca/textarchives/toc.html]
Vincent De Rooij, University of Amsterdam
Max Planck Institute Language Archive and DOBES Archive
Peter Wittenburg, Max Planck Institute
NAA: National Anthropological Archives
[http://www.nmnh.si.edu/naa/]
Robert Leopold, Smithsonian Institution
OTA: Oxford Text Archive
[http://ota.ahds.ac.uk/ota/]
Michael Popham, Oxford University
SIL Language and Culture Archive
Joan Spanne, Summer Institute of Linguistics
SIL-MEX: SIL Mexico Archive
[http://www.sil.org/mexico/]
Albert Bickford, Summer Institute of Lingustics
Survey of California and Other Indian Languages
[http://linguistics.berkeley.edu/Survey/]
Leanne Hinton, University of California, Berkeley
UHLCS: University of Helsinki Language Corpus Server
[http://www.ling.helsinki.fi/uhlcs/]
Pirkko Suihkonen, University of Helsinki

Most of these archives have a partial digital catalog, and about 25% have a complete digital catalog. A couple of them use MARC or TEI. The following is a list of catalog fields which are used or proposed by the above archives.

language id (for the resource and for its subject, ethnologue code, RFC 1766, ISO 639-2, alternative language names, language group)
title of the resource, transliterated title
resource type (e.g. lexicon, text, signal, ...)
modality (e.g. text, audio, video, physiological, ...)
file format, sample rate, number of tracks, size
media type, dimensions (of book) or number (of CD-ROMs)
recording details (e.g. microphone type)
genre (e.g. narrative, instructional, greeting, ...)
thematic topics
register (e.g. formal, informal, honorific, collaborative, ...)
event type (e.g. interview, meeting, ceremony, announcement, ...)
participant description (e.g. name, age, gender, education, ...)
interviewer, recorder, transcriber, ...
speech style (e.g. whisper, mutter, talk, sing, falsetto, ...)
transcription type (e.g. phonetic, orthographic, gesture, musical, ...)
translation type (e.g. morpheme-level, word-level, sentence-level, ...)
date, location (e.g. of creation, encoding, publication)
access rights, use restrictions, copyrights, licenses, price
editor, series name, series number, publisher
catalog number (local, ISBN, ...)
project for which the resource was created
technological applications of the resource (e.g. machine translation, ...)
URL for an online version of the resource, or for documentation
provenance of the resource (e.g. geographical origin)
historical period covered by the resource
thesis level, degree granting institution
software version, platform
contact person/institution, address

Archives use some subset of these elements, in a variety of formats. For certain elements an archive has evidently adopted a controlled vocabularly. At present there are no widely used standards for the storage format, or for the controlled vocabularies, such that the catalog information from different archives is comparable.

About half of these archives have some materials in digital form, and about 20% are completely digital. Digital materials are stored in a variety of formats, including: HTML, SGML, XML, PDF, TEI Lite, Filemaker, MS Access, MS Word, and project-internal formats.

To find out what is available, it is necessary to consult the catalogs of each archive independently, typically using different interfaces and vocabularies for each one.

There are links pages, e.g. Corpus Linguistics.

2. What tools are available?

Available tools are listed on several links pages, including the following:

For LinguistList and the CMU AI Repository, the categorization of the tools is by application domain (e.g. text analysis, morphology, fonts, ...). For the Linguistic Annotation and Linguistic Exploration pages, there is a key for the platform. In the other cases there is no categorization.

The ACL/DFKI Natural Language Software Registry

The Natural Language Software Registry is a key community resource initiated by the ACL and organized by DFKI in Saarbrücken.

Uses a taxonomy based on: State of the art in Language Technology

http://registry.dfki.de/ Hans Uszkoreit, Thierry Declerck

Categories:

annotation tools
evaluation tools
resources: grammars, lexicons, multimodal corpora, spoken language corpora, terminology, written language corpora
multimodality
NLP development aid: tools, formalisms, machine learning methods, architectures, theories
spoken language: signal analysis, signal editing, signal process, speaker recognition, speech analysis, speech editing, speech processing, speech production, speech recognition, speech synthesis, spoken dialog systems, spoken language generation, spoken language translation, spoken language understanding, text-to-speech synthesis, voice analysis, voice processing
written language: alignment tools, corpus analysis, deep generation, deep syntactic analysis, document image analysis, grammar and style checkers, handling controlled languages, information extraction, information retrieval, language guesser, lemmatizer, lexicon management, morphological generation, morphological analysis, optical character recognition, part-of-speech tagging, partial parsing, processing mark-up languages, segmenter, semantic and pragmatic analysis, shallow generation, shallow parsing, speech checkers, stemmer, summarization, terminology extraction, terminology management, text classification, tokenizaitno, translation memory, written dialog systems, written language translation, written language understanding

Search form, permitting search on the following fields: name, abstract, description, license (free, to negotiate, commercial), kind of license (academic, multiple user, commercial), main section, operating system, supported language

3. How adequate are these resources? (draft)

learn by trial and error

no systematic evaluation available

just tools - no support for interoperability, standard formats, etc

best practice recommendations exist (e.g. TEI, CES) - what is the extent of their adoption?

4. Who is creating and using these resources?

The community is arranged into three main groups. The first group is engaged in the core activity of generating and using language resources. The second group provides the technical foundation for this core activity, while the third group constitutes the adminstrative umbrella.

1. CREATORS AND USERS OF LANGUAGE RESOURCES - THE CORE ACTIVITY

Speakers
using and learning languages; providing primary materials and commentary; promoting language use and teaching.

Descriptivists
linguists, sociolinguists, and linguistic anthropologists documenting language structure and use.

Educators
teaching specific languages, and the linguistic structure of specific languages.

Theorists
developing new models of the human language faculty.

Technologists
developing new human language technologies.

2. IMMEDIATE INFRASTRUCTURE - THE TECHNICAL FOUNDATION

Archivists
digital archivists and librarians providing storage and access for language resources.

Developers
computer scientists developing models, formats, architectures and tools for creating and searching digital language data.

Publishers
disseminating language resources in paper and digital form.

3. SPONSORS AND PROMOTERS - THE UMBRELLA

Professional Associations
promoting language resources, and the adoption of best-practices for digital archives.

Government Funding Agencies
establishing funding priorities, and evaluating and enabling language resources.

Non-Governmental Organizations
promoting and funding language resources.

Table 1: The Language Resources Community

Some archives catalog/distribute the resources of others.

5. Where can I go for advice?

Creators, users and archivers of language resources are often faced with a bewildering array of technological options, with no obvious source for competent advice. The most popular method for obtaining advice is the large collection of electronic mailing lists. On many of the following lists there is significant exchange of information concerning best practices.

anthro-list [email protected]

archives-list [email protected]

corpora-list [email protected]

diglib-list [email protected]

elsnet-list [email protected]

electronic-records-list [email protected]

empiricists-list [email protected]

endangered-languages-list [email protected]

exploration-list [email protected]

language-culture-list [email protected]

linganth-list [email protected]

linguist-list [email protected]

nl-kr-list [email protected]

salt-request [email protected]

saltmil [email protected]

Another source of advice is the LTG Helpdesk. This site represents a vision for a repository / clearing house for best practice recommendations.

People needing advice typically resort to posting a query on one or more lists, sorting through the responses, and possibly posting a summary of responses back to the lists. However, it is often difficult to decide a good course of action, when the primary information is an uncoordinated set of suggestions originating from strangers on a mailing list. In an period of rapidly evolving technology, a wrong choice can wind up in a dead end, and painstakingly collected data ends up being unusable. Numerous experiences of this community attest to this reality. So how can we make wise use of the new technological opportunities before us?

anthro-list	[email protected]
archives-list	[email protected]
corpora-list	[email protected]
diglib-list	[email protected]
elsnet-list	[email protected]
electronic-records-list	[email protected]
empiricists-list	[email protected]
endangered-languages-list	[email protected]
exploration-list	[email protected]
language-culture-list	[email protected]
linganth-list	[email protected]
linguist-list	[email protected]
nl-kr-list	[email protected]
salt-request	[email protected]
saltmil	[email protected]