OLAC Metadata Metrics

Date issued:2009-06-29
Status of document:Informational Note. This document provides background information related to an OLAC standard, recommendation, or service.
This version:http://www.language-archives.org/NOTE/metrics-20090629.html
Latest version:http://www.language-archives.org/NOTE/metrics.html
Previous version:http://www.language-archives.org/NOTE/metrics-20090218.html
Abstract:

Explains the metrics that are implemented on the OLAC web site for summarizing the coverage of the participating archives and for evaluating the quality of their metadata records.

Editors: Gary Simons, SIL International and Graduate Institute of Applied Linguistics (mailto:[email protected])
Changes since previous version:

The original implementation deducted one star from the overall rating for the existence of any error. The revised implementation weights the deduction by the square root of the error rate as explained below under Overall rating.

Copyright © 2009 Gary Simons (SIL International and Graduate Institute of Applied Linguistics). This material may be distributed and repurposed subject to the terms and conditions set forth in the Creative Commons Attribution-ShareAlike 2.5 License.

Table of contents

  1. Introduction
  2. The quality score
  3. Overall rating
  4. Integrity checks
  5. Other metrics
References

1. Introduction

The vision of OLAC is that "any user on the Internet should be able to go to a single gateway to find all the language resources available at all participating institutions" (see vision statement in [OLAC-Process]). The ability of a user to discover any relevant language resource is dependent on the quality of the metadata that describe it. Ensuring quality through peer review is a core value that OLAC employs to achieve its vision. "OLAC also conducts automated review based on peer consensus regarding best practice" (see core value statements in [OLAC-Process]).

Section 2 of this note explains the automated system that is implemented on the OLAC web site for evaluating the quality of metadata records. Section 3 explains the derivation of the overall quality rating as a score of 0 to 5 stars, based on the average metadata quality score less a deduction for known integrity problems if there are any. Section 4 explains the integrity checks that are performed. Finally, section 5 explains the other metrics that are reported in the [OLAC-Metrics] reports to support comparison of size and coverage of collections in addition to aspects of metadata quality and usage.

2. The quality score

The peer consensus regarding best practice is expressed in [OLAC-BPR] and further elucidated in [OLAC-Usage]. Many of the best practice recommendations for resource description cannot be automatically checked for conformance; however, there are many that can be. As an aid to creating descriptive metadata that meet the latter set of recommendations, OLAC has implemented an automated metadata quality score. Each metadata record receives a score in the range of 0 to 10 based on the presence or absence of recommended practices.

The practices in focus for the evaluation of metadata quality are ones that contribute to resource discovery. The score has two major parts: 50% is based on the metadata elements that are present and 50% is based on the use of encoding schemes. The elements provide the breadth and depth of the description, while the encoding schemes provide precision for interoperable searching.

The element part of the score consists of 4 points awarded for each of four basic metadata elements that must be present to give the record minimal breadth of coverage, plus a further point awarded for additional elements that add to the depth of description. In the descriptions below, a non-empty metadata element is one that supplies a value, whether through element content or through the olac:code attribute. The element-based components of the score are awarded as follows:

Title

One point is awarded for the presence of a non-empty Title element. Absence of a title that is inherent to the resource does not block achieving this point, since in that case it is recommended best practice for the cataloger to supply a descriptive title enclosed in square brackets.

Date

One point is awarded for the presence of at least one non-empty Date element (or any of its refinements). Absence of a date in the resource itself does not block achieving this point, since in that case it is recommended best practice for the cataloger to supply an estimated date enclosed in square brackets.

Agent (Contributor, Creator, or Publisher)

One point is awarded for the presence of at least one non-empty element that provides an indication of who is behind the resource, whether as Contributor or Creator or Publisher.

About (Subject, Description, or Coverage)

One point is awarded for the presence of at least one non-empty element that provides an indication of what the resource is about, whether Subject or Description or Coverage (or any refinement of the latter two).

Depth

One-sixth point (up to a maximum of one point) is awarded for each element that is present in addition to the 8 that must be present in order to receive the 4 points above for basic elements and the 4 points that follow for basic encoding schemes. If the record has fewer than 8 elements, this part of the score is 0; otherwise, it is (total elements - 8) / 6 or 1, whichever is less. Note that in order to get the full score on this point, a record must contain at least 14 elements.

The encoding scheme part of the score consists of 4 points awarded for each of four basic element-plus-scheme pairs that must be present to support high recall and precision in searches for language resources. A further point is awarded for additional use of encoding schemes that add to the precision of resource description. The scheme-based components of the score are awarded as follows:

Content Language

One point is awarded for the presence of at least one Language element that uses the olac:language encoding scheme with a value from [OLAC-Language] in olac:code to precisely identify the language of content of the resource. Absence of any natural language content in a resource (such as in a software tool) does not block achieving this point, since in that case it is recommended best practice is to use the ISO 639-3 code zxx meaning "No linguistic content."

Linguistic Type

One point is awarded for the presence of at least one Type element that uses the olac:linguistic-type encoding scheme with a value from [OLAC-Type] in olac:code to precisely identify the type of the resource from a linguistic point of view. Such a metadata element is relevant to the majority of OLAC records, but not to all. The remedy that has been identified is to extend the Linguistic Data Type vocabulary to a generally applicable Language Resource Type vocabulary that will be relevant to all OLAC records. Until the work is done to redefine the vocabulary, records for which Linguistic Data Type is not relevant will not earn this point.

Subject Language

One point is awarded for the presence of at least one Subject element that uses the olac:language encoding scheme with a value from [OLAC-Language] in olac:code ito precisely identify the language that the resource is about. The notion of subject language is not relevant to every language resource. When the linguistic type of a resource is "primary_text" it is not required to have a subject language, and this point is awarded automatically. (Until the problem mentioned above under Linguistic Type is solved by creating a more general Language Resource Type vocabulary, the point will also be awarded automatically when there is no instance of olac:linguistic-type. This means that a resource other than a primary text for which subject language is truly not applicable will lose the point for Linguistic Type, but not be doubly penalized in the point for Subject Language.) When the linguistic type has any other value, there must be at least one Subject element using the olac:language encoding scheme in order to earn this point.

DCMI Type

One point is awarded for the presence of at least one Type element that uses the dcterms:DCMIType encoding scheme [DCMI-Type] to identify the generic type of the resource. The vocabulary is designed to be applicable to any resource and this is considered mandatory for OLAC metadata in order to support reliable searching for resources by type (such as audio recordings versus video recordings versus textual data versus software).

Precision

One-third point (up to a maximum of one point) is awarded for each additional encoding scheme that is used in the metadata record. Thus in order to earn full points, a record must use at least three encoding schemes in addition to olac:language, olac:linguistic-type, and dcterms:DCMIType.

The free-standing metadata service [OLAC-Free] can be used to see what quality score will be awarded to a given OLAC metadata record. The XML encoding of a record is pasted into a submission form. The service then validates the record, and if it is valid, a report of its quality score is generated with comments on what must be done to raise the score to 10. The same quality analysis is shown for a sample record from each participating archive by following the "Sample Record" link on the [OLAC-Archives] page.

The average quality score for all the records provided by a given participating archive can be seen by following the "Metrics" link on the [OLAC-Archives] page. The metrics report also shows the breakdown across the collection of all the components that go into the quality score.

3. Overall rating

The overall metadata quality is summarized graphically as a rating of 0 to 5 stars. The overall rating is computed as a base rating minus a penalty if the metadata are known to contain integrity violations.

The base rating is computed from the average of The quality score for all records in the repository. The average quality score (which ranges from 0 to 10) is converted to the base rating by dividing by 2 and rounding to the nearest whole number. Thus, an average quality score of 9 or higher converts to 5 stars, scores in the range of 7 to 9 convert to 4 stars, and so on.

In the absence of known integrity problems, the overall rating is reported as the base rating. If integrity problems have been detected, then a penalty is deducted from the base rating before rounding to the nearest whole number. The deduction is the square root of the number of integrity errors per record. Thus, if the repository averages one error per record, the deduction is one star; if 4 errors per record, then 2 stars; if 9 errors per record, then 3 stars, and so on. Conversely, if there is one error every 4 records, then the deduction is .5 stars; if one per 9 records, then .33 stars, and so on. The formula for the overall rating as a whole number of stars is therefore:

rating = round( (average_quality / 2) - (errors / records)^0.5 )

The number of errors for the repository with identifier archive-id is found in its metrics report at:

http://www.language-archives.org/metrics/archive-id

If the reported number of "Known Integrity Problems" is greater than 0, then a deduction has been made in assessing the overall rating. When the number is 0, but formatted as a link to the integrity report, no deduction has been made but there are warnings of potential problems. The next section describes the integrity violations and warnings that are detected automatically.

4. Integrity checks

The report of known and potential integrity problems for the archive with identifier archive-id is found at:

http://www.language-archives.org/checks/archive-id

The report has two sections: Errors (which cause a deduction to the overall rating) and Warnings (which do not). The report has three columns: the error or warning message, the offending value, and the id of the record the problem occurs in (as a link to the online display of the record). The top of the report has a link which allows the information to be downloaded in TSV (tab-separated value) format. In the download table, the message is reported as a three-letter code and an extra column with value E or W is added to indicate the severity as "Error" or "Warning".

These are the possible error messages:

BCC  

Bad Country Code — The value supplied for dcterms:ISO3166 is not defined in the ISO 3166 code set.

BCR  

Bad Contributor Role — The value supplied for olac:role is not defined in the vocabulary.

BDI  

Bad Discourse Type — The value supplied for olac:discourse-type is not defined in the vocabulary.

BDT  

Bad DCMI Type — The value supplied for dcterms:DCMIType is not defined in the vocabulary.

BLC  

Bad Language Code — The value supplied for olac:language is not defined in the ISO 639 code set.

BLF  

Bad Linguistic Field — The value supplied for olac:linguistic-field is not defined in the vocabulary.

BLT  

Bad Linguistic Type — The value supplied for olac:linguistic-type is not defined in the vocabulary.

BSI  

Bad Sample Identifier — The <sampleIdentifier> specified in the Identify response is not present in the repository.

NSI  

No Such Item — The combined OLAC catalog does not contain an entry with the given OAI identifier.

RNC  

Repository Not Current — The currentAsOf date is more than 12 months old.

RNF  

Resource Not Found — An attempt to follow the link yields a 404 (Resource not found) error.

These are the possible warning messages:

BRU  

Broken Repository URL — Accessing the URL for a static repository or the base URL for a dynamic repository generates a 404 error.

HFC  

Harvesting Fails to Complete — Some records are being harvested, but an integrity issue in the data or a bug in the repository software is causing premature termination.

MLC  

Missing Language Code — The element uses olac:language extension but no olac:code is given.

PLC  

Private Use Language Code — The value supplied for olac:language is a private use code in the range qaa to qtz. It should be changed to a standard code as soon as one becomes available; submit a change request if necessary [ISO639-3-Changes].

RLC  

Retired Language Code — The supplied value is a recognized code from ISO 639, but it is not best practice since it is a retired code. Consult the ISO 639 documentation for the code to learn what codes have replaced it.

RNA  

Resource Not Available — An attempt to follow the link failed for a reason other than a 404 error.

RNV  

Repository Not Valid — The retrieved static repository file is not valid.

SIL  

Should be Individual Language — The value supplied for olac:language is a recognized code from ISO 639, but it is not best practice since it represents a collection of languages.

5. Other metrics

The [OLAC-Metrics] page reports a set of metrics that summarize the size and coverage of each participating archive as well as the quality of their metadata records. The "OLAC Archive Metrics" tab reports the metrics for the participating archive that has been selected from the drop down list. The "Comparative Archive Metrics" tab shows the summary statistics for all participating archives in a single table. When first opened, the rows of the table are in alphabetical order of the archive names. The rows can be reordered to reflect their rank with respect to a particular metric by clicking in the column header for that metric. Clicking again reverses the order.

When "ALL ARCHIVES" is selected, the Summary Statistics table begins with the following three metrics that apply only to the OLAC catalog as a whole; when an individual archive is selected, these metrics are absent.

Number of Archives

The total number of metadata repositories that are currently being harvested by the OLAC aggregator. A complete enumeration of the participating archives is given on the [OLAC-Archives] page.

Archives with Fresh Metadata

The number (and percentage) of participating archives that have updated their metadata repositories within the past twelve months. A repository is counted as having been updated within the past twelve months if either the currentAsOf date in the OLAC archive description (see section 3 of [OLAC-Repositories]) is within the past twelve months or the most recent datestamp for an individual metadata record is within the past twelve months.

Archives with Five-star Metadata

The number (and percentage) of participating archives that receive the top score for overall metadata quality (see Overall rating).

The following metrics summarize the size and coverage of the selected archive (or of all archives when that is selected):

Number of Resources

The total number of metadata records in the repository of the selected archive.

Number of Resources Online

The number of records from the selected archive describing resources that are accessible online; that is, they have an Identifier element whose value is a URL beginning with http:, https:, or ftp:.

Distinct Languages

The number of distinct languages that are covered within the selected archive's collection; that is, the number of distinct code values that are used from the olac:language encoding scheme [OLAC-Language], whether with the Language element or the Subject element.

Distinct Linguistic Subfields

The number of distinct linguistic subfields that occur as subject classifications within the selected archive's collection; that is, the number of distinct code values that are used from the olac:linguistic-field encoding scheme [OLAC-Field].

Distinct Linguistic Types

The number of distinct linguistic data types (e.g. primary_text versus lexicon versus language_description) that occur within the selected archive's collection; that is, the number of distinct code values that are used from the olac:linguistic-type encoding scheme [OLAC-Type].

Distinct DCMI Types

The number of distinct DCMI resource types (e.g. Text, Sound, MovingImage, Software, and so on) that occur within the selected archive's collection; that is, the number of distinct values that are used from the dcterms:DCMIType encoding scheme [DCMI-Type].

The following metrics summarize aspects of metadata quality for the selected archive (or for all archives when that is selected):

Average Elements Per Record

The average number of elements (including refinements from the dcterms namespace) per metadata record.

Average Encoding Schemes Per Record

The average number of elements per metadata record that use the xsi:type attribute to specify an encoding scheme for expressing the value.

Average Metadata Quality Score

The average of the quality score for all the metadata records in the selected archive (see The quality score); the maximum value is 10.

Date of Latest Update

The date on which the archive last updated its metadata repository. It is computed as the most recent of the <datestamp> values that occur in the headers of the metadata records as returned by the OAI-PMH protocol.

The OLAC Archive Metrics page continues with a Metadata Usage summary consisting of four histograms:

Core Components

This histogram reports the use of core metadata components as recommended by [OLAC-BPR]. The eight lines correspond to the eight components of The quality score that awarded as full points for the presence or absence of a recommended element or encoding scheme. The length of a bar represents the percentage of metadata records that contain that metadata component.

Element Usage

This histogram lists all of the metadata elements in the Dublin Core scheme. The length of a bar represents the total number of times a given element has been used within the records of the selected archive. It is the count of element uses (not records that use the element); thus the counts exceed the total number of resources in the archive for elements that occur multiple times per record.

Refinement Usage

This histogram lists all of the defined refinements to metadata elements in the Dublin Core scheme. The length of a bar represents the total number of times a given refinement has been used within the records of the selected archive. It is the count of refinement uses (not records that use the refinement); thus the counts exceed the total number of resources in the archive for refinements that occur multiple times per record.

Encoding Scheme Usage

This histogram lists all of the encoding schemes that may occur as the value of the xsi:type attribute. The length of a bar represents the total number of times a given encoding scheme has been used within the records of the selected archive. It is the count of encoding scheme uses (not records that use the encoding scheme); thus the counts exceed the total number of resources in the archive for encoding schemes that occur multiple times per record.


References

[DCMI-Type]DCMI Type Vocabulary.
<http://dublincore.org/documents/dcmi-type-vocabulary/>
[ISO639-3-Changes]Submitting ISO 639-3 Change Requests.
<http://www.sil.org/iso639-3/submit_changes.asp>
[OLAC-Archives]OLAC: Participating Archives.
<http://www.language-archives.org/archives.php4>
[OLAC-BPR]Best Practice Recommendations for Language Resource Description.
<http://www.language-archives.org/REC/bpr.html>
[OLAC-Field]OLAC Linguistic Subject Vocabulary.
<http://www.language-archives.org/REC/field.html>
[OLAC-Free]Free-standing OLAC Metadata.
<http://www.language-archives.org/tools/metadata/freestanding.html>
[OLAC-Language]OLAC Language Extension.
<http://www.language-archives.org/REC/language.html>
[OLAC-Metrics]OLAC Archive Metrics and Comparative Archive Metrics.
<http://www.language-archives.org/metrics/>
[OLAC-Process]OLAC Process, Section 2, "Governing ideas".
<http://www.language-archives.org/OLAC/process.html#Governing%20ideas>
[OLAC-Repositories]OLAC Repositories.
<http://www.language-archives.org/OLAC/repositories.html>
[OLAC-Type]OLAC Linguistic Data Type Vocabulary.
<http://www.language-archives.org/REC/type.html>
[OLAC-Usage]OLAC Metadata Usage Guidelines.
<http://www.language-archives.org/NOTE/usage.html>