Friday, September 01, 2006

More DOIs


Following on from an earleir post, I've now added DOI extraction for SciELO, which hosts Brazilian publications, and Taylor and Francis. This was motivated by searching iSpecies for the ant Trachymyrmex opulentus, for which only papers hosted by these two publishers appear in the search results.


Again, we are reduced to screen scraping (sigh). Why oh why don't the people who design these web sites get their act together and embed useful information in the HTML, rather than assume that only humans will make use of these pages?

One provider that is clued up is Ingenta. For example, take a look at the HTML for the article "Influence of Topography on the Distribution of Ground-Dwelling Ants in an Amazonian Forest" (doi:10.1076/snfe.38.2.115.15923) on the Ingenta site (Firefox and Camino users can see the source here). Embedded in the <meta> tags is all sorts of metadata, including the DOI:

<meta name="DC.identifier" scheme="URI"
content="info:doi/10.1076/snfe.38.2.115.15923"/>

The use of consistently formatted tags makes data extraction much easier. Of course, it's no surprise that Ingenta do this well (check out their blog).