OLAC Record
oai:scholarspace.manoa.hawaii.edu:10125/25368

Metadata
Title:A general format for time information to be the first-class data of general linguistics
Bibliographic Citation:Ohya, Kazushi, Ohya, Kazushi; 2015-02-27; This presentation aims at proposing philosophy and an actual description of data format for time information to realize a new phenomenon language documentation brings about with computational environments. This data format should be simple and cogent because it is used by linguists as a common and fundamental format to record sound data to relate it to encoded language data. To be simple, the data format is based on a flat data model and the actual description had better be a plain text. To be cogent, the data format is based on mathematical foundations. In this proposal, elements in records line up in superset order, which means that a left-side element is a superset of the right-side element. The elements are an ID or an equivalent of it such as a file name, or a pair of time information with start and end timestamps. An example of the actual description of a record is "original_sound.wav,00:00:13,00:01:03.25,part_of_sound1.wav." The reasons why this type of data format is needed are as follows. (1) As shown in [Author2012] a key strategy for sharing language resources is a data conversion service. From our experiments, data formats based on a multi-link-path model proposed by international organizations or research projects such as IOS LAF/GrAF[Ide 2006, ISO24612] and TEI[Bauman 2008, ISO24610-1] have drawbacks of data size and data manipulation[Author2009]. If we use these formats we have to prepare flexible data conversion programs or services[Author2011, Author2012]. And to reduce the number of link paths, a part of defining data units in a standoff style can be moved from and be an independent data file. The data format proposed here can be used for this kind of data. (2) There is a one-to-many relationship between actual sound and sound data, and a many-to-many relationship between sound data and encoded language data. Thus, when replaying sound from sound data, there must be time information. To realize sound data as the first-class data in linguistics, linguists have to prepare for changing from consumers to providers of sound psychologically and practically. This format proposed in this presentation is not a big result of an academic study, but could be a case example for linguists to start considering the need of time information in their language documentation. Brief References [Author 2009] [Author 2011] [Author 2012] Bauman, S. and L.Burnard (2008) TEI P5 Guidelines for Electronic Text Encoding and Interchange, TEI Ide, N and K. Suderman (2006) GrAF: A GrAF-based Format for Linguistic Annotations, Proc. of the Linguistic Annotation Workshop ISO 24612 (2012) Language resource management -- Linguistic annotation framework(LAF) --, ISO ISO 24610-1 (2006) Language resource management -- Feature structures -- Part1: Feature structure representation (FSR), ISO; Kaipuleohone University of Hawai'i Digital Language Archive;http://hdl.handle.net/10125/25368.
Contributor (speaker):Ohya, Kazushi
Creator:Ohya, Kazushi
Date (W3CDTF):2015-03-12
Description:This presentation aims at proposing philosophy and an actual description of data format for time information to realize a new phenomenon language documentation brings about with computational environments. This data format should be simple and cogent because it is used by linguists as a common and fundamental format to record sound data to relate it to encoded language data. To be simple, the data format is based on a flat data model and the actual description had better be a plain text. To be cogent, the data format is based on mathematical foundations. In this proposal, elements in records line up in superset order, which means that a left-side element is a superset of the right-side element. The elements are an ID or an equivalent of it such as a file name, or a pair of time information with start and end timestamps. An example of the actual description of a record is "original_sound.wav,00:00:13,00:01:03.25,part_of_sound1.wav." The reasons why this type of data format is needed are as follows. (1) As shown in [Author2012] a key strategy for sharing language resources is a data conversion service. From our experiments, data formats based on a multi-link-path model proposed by international organizations or research projects such as IOS LAF/GrAF[Ide 2006, ISO24612] and TEI[Bauman 2008, ISO24610-1] have drawbacks of data size and data manipulation[Author2009]. If we use these formats we have to prepare flexible data conversion programs or services[Author2011, Author2012]. And to reduce the number of link paths, a part of defining data units in a standoff style can be moved from and be an independent data file. The data format proposed here can be used for this kind of data. (2) There is a one-to-many relationship between actual sound and sound data, and a many-to-many relationship between sound data and encoded language data. Thus, when replaying sound from sound data, there must be time information. To realize sound data as the first-class data in linguistics, linguists have to prepare for changing from consumers to providers of sound psychologically and practically. This format proposed in this presentation is not a big result of an academic study, but could be a case example for linguists to start considering the need of time information in their language documentation. Brief References [Author 2009] [Author 2011] [Author 2012] Bauman, S. and L.Burnard (2008) TEI P5 Guidelines for Electronic Text Encoding and Interchange, TEI Ide, N and K. Suderman (2006) GrAF: A GrAF-based Format for Linguistic Annotations, Proc. of the Linguistic Annotation Workshop ISO 24612 (2012) Language resource management -- Linguistic annotation framework(LAF) --, ISO ISO 24610-1 (2006) Language resource management -- Feature structures -- Part1: Feature structure representation (FSR), ISO
Identifier (URI):http://hdl.handle.net/10125/25368
Rights:Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported
Table Of Contents:25368.pdf
25368.zip

OLAC Info

Archive:  Language Documentation and Conservation
Description:  http://www.language-archives.org/archive/ldc.scholarspace.manoa.hawaii.edu
GetRecord:  OAI-PMH request for OLAC format
GetRecord:  Pre-generated XML file

OAI Info

OaiIdentifier:  oai:scholarspace.manoa.hawaii.edu:10125/25368
DateStamp:  2024-08-11
GetRecord:  OAI-PMH request for simple DC format

Search Info

Citation: Ohya, Kazushi. 2015. Language Documentation and Conservation.


http://www.language-archives.org/item.php/oai:scholarspace.manoa.hawaii.edu:10125/25368
Up-to-date as of: Mon Nov 18 7:29:31 EST 2024