OLAC Implementers' FAQ |
The following questions are answered elsewhere in the general OLAC FAQ:
In the prototypical case, a participating institution is an archive that curates language resources. In this case, the OLAC metadata repository it implements is a catalog of its archival holdings. (Participants that operate such archives can be identified by the presence of the <archivalSubmissionPolicy> element in the <olac-archive> description that is part of the Identify response of their repository.) But operating an archive is not a requirement. The mission of OLAC is to create a "worldwide virtual library of language resources" and anyone who can contribute useful information to the catalog of known language resources is invited to do so. This may include contributions from individuals rather than institutions. It may also include metadata repositories that are indexes to language resources supplied by others or that describe entry points into an online database of language-related information.
Each record in an OLAC metadata repository represents a language resource that the participating archive wants to inform the world about. The resource could be a physical object like a book, a CD-ROM, a wax cylinder or a box of unpublished field notes. In such cases, the metadata record allows potential users to discover that the resource exists. When the resource is a digital object that is posted on the web (such as a document, a corpus of recordings, a database or a software program), then the metadata record can go a step further to provide access to the resource by supplying the URL in the <dc:identifier> element. The record need not describe resource that the contributor actually has under archival control. For instance, a record could be an annotated bibliography entry in which the contributor has added value to the basic citations by supplying a description and subject classification. Or a record could describe a web page that is an index page giving links to language resources that are posted on other web sites.
An OLAC metadata repository is essentially a set of language resource descriptions. A good place to start is to look at the OLAC Metadata Usage Guidelines to learn how OLAC describes language resources. That document defines and illustrates each of the possible elements of a language resource description. It will be helpful at the same time to be looking at examples of complete metadata records from existing OLAC participants. A key decision you must make before you can begin implementing is to decide what to treat as an individual item. As you are looking at examples from other OLAC participants, make note of records that match your situation and which you could use as models for resource description in your repository. However, rather than simply using an existing record as a template for your own records, you should evaluate it against OLAC's best practice recommendations for language resource description. In the process you may discover ways to improve the quality of the metadata records you will produce.
Once you have a good idea of what the records in your repository will be like, the next step is to read carefully the OLAC Repositories standard. You will see that it defines two approaches, implementing a static repository versus a dynamic repository. You need to decide which kind of repository to implement, then focus on the corresponding section of the OLAC standard. If you are implementing a static repository, you will want to look at the more complete Specification for an OAI Static Repository which is the standard from the digital library community that the OLAC standard is based on. You may also find it helpful at this stage to look at complete examples of static repositories. When you are ready to begin implementing, you may follow the instructions for creating a repository by hand or for using a script to generate a repository from a database.
Similarly, if you are implementing a dynamic repository, you will need to consult the complete specification of the Open Archives Initiative Protocol for Metadata Harvesting and may want to see complete examples of dynamic repositories. When you are ready to begin implementing, you will want to start by searching for sample code that you can use as a basis for your implementation.
The OLAC Repositories standard addresses this issue in a section on Guidelines concerning relevance and granularity. The basic guideline is this: "A metadata repository should treat resources with a single provenance as constituting a single unit with respect to OLAC metadata and should, therefore, describe them within a single record." For published resources, the publication unit typically constitutes the appropriate unit for the OLAC metadata record. For unpublished papers presenting findings of research, these closely parallel typical published works, and can be treated at a comparable level in an OLAC metadata record. For primary source materials (e.g., recordings, transcriptions, annotations, notes, data sets), the typical practice of archivists is to gather such materials into collections based on shared provenance—this is, based on having a common origin and history. These collections are then the primary units for metadata description, resulting in OLAC records of DCMI Type Collection. See the section on Granularity of resources in the OLAC Metadata Usage Guidelines for a more in-depth discussion of the principle of provenance as applied to collections and metadata within the OLAC context.
There are many valuable language resources that are underused because they are part of the Deep Web (that is, the portion of the Internet that is obscured from discovery by general search engines because the resources are in a database that is accessible only via a search interface on its host site). Such resources can be brought to the indexable web by creating OLAC records for them, since it is a built-in service of OLAC to convert every record into a page that gets crawled by web spiders. (To see the set of pages for a given archive, click the Archives link on the OLAC home page, click the "More Details" link for the desired archive, and then click the "Records in Archive" link on the resulting Archive Details page.)
The only requirement on a web database for using an OLAC repository to expose and access its language resources is that it have publicly accessible URLs for the resources. These often involve a base URL that uses parameters to provide arguments to a query. For instance, the resources in the LINGUIST List Language Resources repository and the ODIN: Online Database of Interlinear Text repository are dynamic pages that generate a listing of everything held in the database about a particular language; the ISO 639-3 code for the language is a parameter to the URL that is given in the <dc:identifier> element of each record. (Click the "Records in Archive" link on the referenced Archive Details pages to see sample records.)
To see the metadata records from a given archive, click the Archives link on the OLAC home page, click the "More Details" link for the desired archive, and then click the "Records in Archive" link on the resulting Archive Details page. Alternatively, you may get to the same place by going to the List Records from an OLAC Archive page, selecting an archive from the dropdown list, and then clicking the Submit Query button. When you click on the identifier for a particular item, you will see an HTML representation of the metadata. To see the underlying XML format of the OLAC metadata record, click on the "OAI-PMH request for OLAC format" link.
The following are good examples of records for different kinds of resources:
The Metadata Usage Survey tool may be used to see examples of how individual metadata elements have been used throughout the OLAC catalog. Note, however, that many uses do not conform to best practice recommendations. A good way to use the tool is to identify ways of using elements that do conform to best practice and then click on the link for the number of occurrences to see a list of the records in which it has been used in this way. It is often helpful to see these elements in context to get ideas of how best to use them in complete metadata records.
There are two basic approaches as laid out in the OLAC Repositories standard. The first is to build a static repository. In a static repository, the entire catalog for the repository is expressed in a single XML file that contains all of the metadata records. This is the simpler of the two approaches and can be used when the repository is relatively small—up to approximately 5,000 records. When there is no existing catalog database, the implementer can use an XML editor to create and maintain a repository by hand. When there is an existing catalog database, the implementer can use a script to export it to the proper XML format.
The second approach is to implement a dynamic repository. This approach can be used for a catalog of any size, but when the catalog is larger than 5,000 records, it becomes necessary to use this approach. In a dynamic repository, the implementer writes program code that resides at a base URL on a web site and responds to the requests of the OAI Protocol for Metadata Harvesting in order to provide dynamic access to information in an existing catalog database. The trickiest part about implementing a dynamic repository is that it is necessary to implement flow control using a resumption token mechanism in order to ensure that the responses to individual protocol requests are not exceedingly large. The OAI community considers half a megabyte to be a reasonable response size (which corresponds to about 500 records in a typical repository).
The last thing in the section on static repositories in the OLAC Repositories standard is a link to a complete example. Clicking this link retrieves an XML document which is a hypothetical sample repository containing two records.
It is also possible to see the XML document behind each of the static repositories that have been implemented by participating archives. Approximately nine-tenths of the repositories are implemented in this way. The XML document for a static repository can be retrieved in whole by inspecting its base URL. That URL is found by clicking the Archives link on the OLAC home page, then clicking the "More Details" link for the desired archive. Find the line that says "Base URL", copy the URL, and paste it into a web browser. If the repository is dynamic, the result will be a documentation page or an error response concerning an illegal verb. However, in nine-tenths of the cases, the repository is static and the result will be the XML document that holds all the contents of the repository.
For a repository that has only a few records, it is very straightforward to create it from scratch using an XML editor. It also makes sense to build and maintain a larger repository by hand if there is no existing database for the metadata. In essence, an XML editor can be used to create and maintain a static repository as the database. For instructions on how to do this, see the OLAC Informational Note on Using <oXygen/> XML Editor to Create or Validate an OLAC Static Repository. <oXygen/> is a mature and full-featured product that offers a 30-day free trial and attractive academic pricing ($48 at the time of writing).
Follow this link to find sample code for generating a static repository. It is written in PHP and extracts the metadata from a MySQL database. Another package (written in C) converts a BibTeX file to an OLAC Static Repository.
Note that it will not work to supply the URL of your script as the base URL for a static repository. While accessing such a URL might return a valid static repository document that is up-to-the-minute with respect to the underlying database, this will not work correctly with the harvester because the harvesting protocol assumes that a static repository is a static data file. The static repository harvester uses the HTTP protocol to issue a HEAD request to determine if the file has changed. It reharvests the repository only if the "Last-Modified" time in the response is more recent than the time of the last successful harvest.
The OLAC Archive Registration page provides a service that makes a final test of conformance to the OLAC Repositories standard before a new repository is admitted to the registration process. This same service can be used at any time to test the conformance of a repository file, even if you know it is not ready for registration or even if a previous version is already registered. There are two cases:
The repository file is posted on a public web site. If your static repository file has been placed on a publicly accessible web site, then go to the OLAC Archive Registration page and paste the URL for your repository file into the first text box on the page. Click the "VALIDATE" button to test the file.
The repository file is only on your work station. If your static repository file is on your local work station, then go to the OLAC Archive Registration page and click the "Browse" button to the right of the second text box on the page. After you have located the repository file, click the "UPLOAD & VALIDATE" button to test the file.
If the final line of the report says "SUCCESS," then you are ready to proceed to the registration step. On the other hand, if the final line says "FAILURE," you still have work to do. Make note of all the individual tests that failed and fix the offending content in the repository; then repeat the validation process. In the case of failing to pass an XML Schema Validation test, click on the "error logs" link to see the list of errors generated by the validator. The two validators report the errors differently, so look at both logs to get a fuller idea of what is wrong. Before the job is finished, you will also want to test some of your records for metadata quality; see below on how to improve the quality of the metadata in your repository.
There are two options. The first option is to post the XML file for your static repository on a public web site, such as the site run by your repository's host institution. Ideally you will want it where you have write access, since you will want to be able to update the repository from time to time. Once your repository is posted for public access, you are ready to register with OLAC.
If you do not have access to a public web site, the second option is to let OLAC host your static repository. If you choose this option, the uploading is done as part of the registration process so you are ready to proceed to registering with OLAC.
Only about one-tenth of the OLAC repositories are implemented in this way. Two of them, Ethnologue and SIL Language and Culture Archives, implement a documentation page that provides a large set of links for testing all the verbs of the harvesting protocol with various combinations of parameters, including combinations that should generate error responses. You may experience a sample implementation and get a better feel for how the protocol works by clicking on one of the above links and trying the various links on the documentation page. The text of each link gives the complete parameter string that is appended to the Base URL to form a full URL. Clicking the link will retrieve the XML document which the dynamic repository returns as the response to the request. The full request URL will show in the location box of your browser. Click the browser's Back button to go back to the documentation page and try another protocol request.
The tools page provided by the Open Archives Initiative gives links to implementations of OAI data providers in Java, Perl, PHP, and Microsoft ASP. These would need to be enriched to support the additional requirements on dynamic repositories in the OLAC Repositories standard.
A best practice that has been suggested within the OAI community is to implement Document as an additional verb that provides a human-readable documentation page. The base URL of your dynamic repository (with no request parameters) can then be redirected to ?verb=Document. OLACA (the OLAC Aggregator) is implemented in this way. It is recommended that you begin your development by making a copy of the OLACA documentation page and changing it to match your situation. (Be sure to set the <BASE> tag with your base URL up to the script name so that the HREFs can simply begin with the script name.) The page contains a sample link to test each significant case of each protocol verb, including all the error conditions. As you are developing and debugging your repository, a first level of testing can be done by clicking these links to manually test the calls and examine the resulting XML responses.
When you are satisfied with the results you are getting, the next step is to try the more formal tests at the Open Archives Initiative - Repository Explorer. After entering the base URL of your repository, this tool will put it through a battery of tests as well as provide a web interface that will allow you to query and explore your repository. When your repository passes all the tests of the OAI protocol, then you are ready to proceed to the OLAC validation and registration process which is described under the next question.
A new repository is registered by means of the OLAC Archive Registration page. There are two cases:
The repository file is posted on a public web site. If you have a dynamic repository or a static repository that has been placed on a publicly accessible web site, then go to the OLAC Archive Registration page and paste the URL for your repository into the first text box on the page. Click the "VALIDATE" button to test the file.
The repository file is only on your work station. If your static repository file is on your local work station, then go to the OLAC Archive Registration page and click the "Browse" button to the right of the second text box on the page. After you have located the repository file, click the "UPLOAD & VALIDATE" button to test the file.
If the final line of the resulting validation report says "FAILURE," then you still have work to do. Make note of all the individual tests that failed and fix the offending content in the repository; then repeat the validation process. If the XML Schema Validation is failing, click on the "error logs" link to see the list of errors generated by the XML parsers. The two parsers report the errors differently, so look at both logs to get a fuller idea of what is wrong.
When the final line of the validation report says "SUCCESS," then you will be presented with a button that allows you to submit a registration request. After clicking the button, you should get an email confirmation of the submission. The OLAC Coordinators also receive notification of the pending registration request. They review the repository to ensure that it meets the criteria enumerated in the OLAC Process standard, namely, that it must catalog language resources and must conform to the standards of OLAC. Following this review you will receive an email notification describing the outcome, either confirmation that the repository has been accepted and harvested, or an explanation concerning what is not yet in order.
OLAC has defined best practice recommendations for language resource description. You should review a few records from your repository against this list of recommendations to see if there are things you could change in the implementation of your repository (or of the underlying database on which it is based) to follow even more of the recommendations. It is not possible to automate tests for compliance to all the recommendations, so a complete review must be done by hand. If you would like help doing such an audit of metadata quality, contact the OLAC Coordinators; the link in the footer of this page is a mailto: link to their addresses.
However, many aspects of metadata quality can be automatically tested. OLAC has implemented metadata metrics to give implementers feedback on how well they have followed those recommendations. The Freestanding Metadata Service allows you to learn the metadata quality score for any record. Simply paste the complete <olac:olac> record into the text box on the page and click the "Analyze" button. If the format of the record is not valid, the resulting page will report the errors. Otherwise, the resulting page shows an analysis of the record. The bottom part of the page gives the metadata quality score along with recommendations on changes that would improve the score. It is possible to click the browser's Back button to return to the record entry form, edit the record to make changes that you think will improve the quality score, and then click "Analyze" again to test the result.
If the repository is hosted on your own site (whether it is static or dynamic), simply update it on your site. There is no need to inform OLAC. As long as the repository is at the same base URL, it will be harvested on the regular schedule (explained in next answer) and your updates will be automatically uploaded into the database at that time.
If it is a static repository that is hosted on the OLAC site, then you must use the registration service to upload the new version. Go to the OLAC Archive Registration page and click the "Browse" button to the right of the second text box on the page. After you have located the revised repository file on your hard disk, click the "UPLOAD & VALIDATE" button to upload the file. This will also test the file for conformance to the OLAC Repositories standard. If there are problems, you will be notified and instructed to correct the problems and repeat the process. If the validation is successful, you will be offered an option to replace the existing repository with the new one. When you select that option, a confirmation request will be sent to the adminEmail declared in the original repository. Once the request is confirmed, the repository will be replaced with the new file.
Incremental harvesting takes place every day. Before issuing the ListRecords command of the OAI-PMH protocol, the harvester looks up the date of the last successful harvest of the repository and adds that date as the value of the from parameter. All records with an oai:datestamp in their header that is equal to or later than that date will be returned by the data provider and uploaded by the harvester. In the case of a static repository, it requires two harvesting cycles to propagate updates — the first harvesting request following an update fails because the static repository gateway recognizes that it must refresh its cache of records; thus by the next request the gateway is able to respond with the updated records.
If more than two days have elapsed since you updated your repository, and the changes you made are not showing when you go to http://www.language-archives.org/archive_records/ to inspect your records, then look at the Participating Archives page to see if your repository is successfully harvesting. There are two possible cases:
If the end of the line for your repository shows a red X, then the last harvest failed. (Click on "More Details"; you will see the date of the last successful harvest near the bottom of the page.) Failure to harvest means that your base URL is not accessible or that your repository has a serious validation problem; enter your base URL into the registration page to generate a validation report.
If the end of the line for your repository shows a light green check mark, then it is harvesting successfully. The reason for failure to update is likely to be the datestamps in the record headers. Remember that the datestamp is the revision date for the metadata record, not for the resource it describes. Thus if the datestamp is not updated to the current date when a record is updated, it will not be picked up by an incremental harvest. The remedy is to update the datestamps. If this is not the cause of the problem, then go to the next question to learn how to force complete reharvest of your repository.
As a guard against failure to update datestamps, a complete reharvest of every repository is performed approximately monthly. The date of full harvesting for any given repository cannot be predicted exactly since the scheduling algorithm is dynamic — using randomization to spread the load across the month and recalibrating priorities when a particular repository fails to harvest. This reharvest takes care of another problem, namely, the purging of deleted records. In a dynamic repository it is possible to specify status="deleted" on a record and the incremental harvest will delete it from the database. However, static repositories do not support deletion. Therefore, any record that is deleted from a static repository will persist in the OLACA database for up to one month until the monthly reharvest is performed.
When making changes to a repository, you may want to force an immediate reharvest rather than waiting for the next scheduled incremental harvest. It is also possible that the complete reharvest which takes place on a monthly basis could lose records if there are problems on your server that cause the process to timeout or otherwise fail to return the records that it should. For these situations there is a service for forcing an immediate complete reharvest. Access the service by going to this page. In your browser add ?id= followed by the Repository ID of your repository at the end of the current URL, then submit that amended URL to your browser. You will be informed that a confirmation email has been sent to the adminEmail that is identified in your Identify response. If you are that administrator, open the confirmation message and click on the linkn it contains. Otherwise, alert the repository administrator that they should have gotten such a message and that they need to click the link within it. As soon as the confirmation link is clicked, the service begins the harvesting process and progress is reported on the screen.
If you want to move the base URL for your repository (without changing the repository identifier), you must re-register. Go to the registration page, enter your new base URL, and click the Validate button. If there are no validation errors, you will be presented with a "Change Registration" button. Clicking it will move your repository to the new base URL. If you also want to change the repository identifier, then you must contact the OLAC administrators who will manually remove your existing repository and registration, and then instruct you to register the new repository.
Comments? Further questions? Please click the mailto: link below to give us feedback so that we can make this page more helpful to future implementers.
http://www.language-archives.org/tools/faq.html
Last revised: 13 August 2011