Friday, January 20, 2006

Identifiers for publications

Despite my enthusiasm for LSIDs, here are some thoughts on indentifiers for publications. Say you want to set up a bibliographic database. How do you generate stable identifiers for the contents?

There's an interesting -- if dated -- review by the National Library of Australia.

The Handle System generates Globally Unique Identifiers (GUIDs), such as hdl:2246/3615 (which can be resolved in Firefox if you have the HDL/DOI extension). Handles can also be resolved with URLs, e.g. http://digitallibrary.amnh.org/dspace/handle/2246/3615 and http://hdl.handle.net/2246/3615. DSpace uses handles.

DOIs deserve serious consideration, despite costs, especially if the goal is to make literature more widely available. With DOIs, metadata will go into CrossRef, and publishers will be able to use that to add URLs to their electronic publications. That means people reading papers online will have immediate access to the papers in your database. Apart from cost, copyright is an issue (is the material you are serving copytighted by sombody else?), and recent papers will already have DOIs. Having more than one is not ideal.

If Handles or DOIs aren't what you want to use, then some sort of persistent URL is an option. Their content can be dynamically generated even if they look like static URLs. For background see Using ForceType For Nicer Page URLs - Implementing ForceType sensibly and Making "clean" URLs with Apache and PHP. To do this in Apache you need a .htaccess file in the web folder, e.g.:

# AcceptPathInfo On is for Apache 2.x, don't use for Apache 1.x
<Files uri>
# AcceptPathInfo On
ForceType application/x-httpd-php
</Files>

You need to ensure that .htaccess can override FileInfo, e.g. have this in httpd.conf:

<Directory "/Users/rpage/Sites/iphylo">
Options Indexes MultiViews
AllowOverride FileInfo
Order allow,deny
Allow from all
</Directory>

This would mean that http://localhost/~rpage/iphylo/uri/234 would execute the file uri (which does not have a PHP extension). The file would look something like this:

<?php

// Parse URL to extract URI
$uri = $_SERVER["SCRIPT_URL"];
$uri = str_replace ($_SERVER["SCRIPT_NAME"] . '/', '', $uri);

// Check for any prefixes, such as "rdf" or "rss" which will flag the
// format to return
// Check that it is indeed a URI
// Lookup in our database
// Display result
?>

Lastly, ARK is another option, which is essentially a URL but it copes with the potential loss of a server. It comes from the California Digital Library. I'm not sure how widely this has been adopted. My sense is it hasn't been, although the Northwest Digital Archives is looking at it.

If cost and hassle are a consideration, I'd go for clean URLs. If you wanted a proper bibliographic archive system I'd consider setting up a DSpace installation. One argument I found interesting in the Australian review is that Handles and DOIs resolve to a URL that may be very different to the identifier, and if people copy the URL in the location bar they won't have copied the GUID, which somewhat defeats the point. In other words, if they are going to store the identifiers, say in a database, they need to get the identifier, not the URL.