OLAC Record oai:scholarspace.manoa.hawaii.edu:10125/25368 |
Metadata | ||
Title: | A general format for time information to be the first-class data of general linguistics | |
Bibliographic Citation: | Ohya, Kazushi, Ohya, Kazushi; 2015-02-27; This presentation aims at proposing philosophy and an actual description of data format for time information to realize a new phenomenon language documentation brings about with computational environments. This data format should be simple and cogent because it is used by linguists as a common and fundamental format to record sound data to relate it to encoded language data. To be simple, the data format is based on a flat data model and the actual description had better be a plain text. To be cogent, the data format is based on mathematical foundations. In this proposal, elements in records line up in superset order, which means that a left-side element is a superset of the right-side element. The elements are an ID or an equivalent of it such as a file name, or a pair of time information with start and end timestamps. An example of the actual description of a record is "original_sound.wav,00:00:13,00:01:03.25,part_of_sound1.wav." The reasons why this type of data format is needed are as follows. (1) As shown in [Author2012] a key strategy for sharing language resources is a data conversion service. From our experiments, data formats based on a multi-link-path model proposed by international organizations or research projects such as IOS LAF/GrAF[Ide 2006, ISO24612] and TEI[Bauman 2008, ISO24610-1] have drawbacks of data size and data manipulation[Author2009]. If we use these formats we have to prepare flexible data conversion programs or services[Author2011, Author2012]. And to reduce the number of link paths, a part of defining data units in a standoff style can be moved from and be an independent data file. The data format proposed here can be used for this kind of data. (2) There is a one-to-many relationship between actual sound and sound data, and a many-to-many relationship between sound data and encoded language data. Thus, when replaying sound from sound data, there must be time information. To realize sound data as the first-class data in linguistics, linguists have to prepare for changing from consumers to providers of sound psychologically and practically. This format proposed in this presentation is not a big result of an academic study, but could be a case example for linguists to start considering the need of time information in their language documentation. Brief References [Author 2009] [Author 2011] [Author 2012] Bauman, S. and L.Burnard (2008) TEI P5 Guidelines for Electronic Text Encoding and Interchange, TEI Ide, N and K. Suderman (2006) GrAF: A GrAF-based Format for Linguistic Annotations, Proc. of the Linguistic Annotation Workshop ISO 24612 (2012) Language resource management -- Linguistic annotation framework(LAF) --, ISO ISO 24610-1 (2006) Language resource management -- Feature structures -- Part1: Feature structure representation (FSR), ISO; Kaipuleohone University of Hawai'i Digital Language Archive;http://hdl.handle.net/10125/25368. | |
Contributor (speaker): | Ohya, Kazushi | |
Creator: | Ohya, Kazushi | |
Date (W3CDTF): | 2015-03-12 | |
Description: | This presentation aims at proposing philosophy and an actual description of data format for time information to realize a new phenomenon language documentation brings about with computational environments. This data format should be simple and cogent because it is used by linguists as a common and fundamental format to record sound data to relate it to encoded language data. To be simple, the data format is based on a flat data model and the actual description had better be a plain text. To be cogent, the data format is based on mathematical foundations. In this proposal, elements in records line up in superset order, which means that a left-side element is a superset of the right-side element. The elements are an ID or an equivalent of it such as a file name, or a pair of time information with start and end timestamps. An example of the actual description of a record is "original_sound.wav,00:00:13,00:01:03.25,part_of_sound1.wav." The reasons why this type of data format is needed are as follows. (1) As shown in [Author2012] a key strategy for sharing language resources is a data conversion service. From our experiments, data formats based on a multi-link-path model proposed by international organizations or research projects such as IOS LAF/GrAF[Ide 2006, ISO24612] and TEI[Bauman 2008, ISO24610-1] have drawbacks of data size and data manipulation[Author2009]. If we use these formats we have to prepare flexible data conversion programs or services[Author2011, Author2012]. And to reduce the number of link paths, a part of defining data units in a standoff style can be moved from and be an independent data file. The data format proposed here can be used for this kind of data. (2) There is a one-to-many relationship between actual sound and sound data, and a many-to-many relationship between sound data and encoded language data. Thus, when replaying sound from sound data, there must be time information. To realize sound data as the first-class data in linguistics, linguists have to prepare for changing from consumers to providers of sound psychologically and practically. This format proposed in this presentation is not a big result of an academic study, but could be a case example for linguists to start considering the need of time information in their language documentation. Brief References [Author 2009] [Author 2011] [Author 2012] Bauman, S. and L.Burnard (2008) TEI P5 Guidelines for Electronic Text Encoding and Interchange, TEI Ide, N and K. Suderman (2006) GrAF: A GrAF-based Format for Linguistic Annotations, Proc. of the Linguistic Annotation Workshop ISO 24612 (2012) Language resource management -- Linguistic annotation framework(LAF) --, ISO ISO 24610-1 (2006) Language resource management -- Feature structures -- Part1: Feature structure representation (FSR), ISO | |
Identifier (URI): | http://hdl.handle.net/10125/25368 | |
Rights: | Creative Commons Attribution-Noncommercial-Share Alike 3.0 Unported | |
Table Of Contents: | 25368.pdf | |
25368.zip | ||
OLAC Info |
||
Archive: | Language Documentation and Conservation | |
Description: | http://www.language-archives.org/archive/ldc.scholarspace.manoa.hawaii.edu | |
GetRecord: | OAI-PMH request for OLAC format | |
GetRecord: | Pre-generated XML file | |
OAI Info |
||
OaiIdentifier: | oai:scholarspace.manoa.hawaii.edu:10125/25368 | |
DateStamp: | 2024-08-11 | |
GetRecord: | OAI-PMH request for simple DC format | |
Search Info | ||
Citation: | Ohya, Kazushi. 2015. Language Documentation and Conservation. |