Tuesday, September 05, 2006

OpenSearch and iSpecies

I've mentioned OpenSearch in an earlier post, in the context of adding additional sources to iSpecies. But it's slowly dawned on me that what i should be doing is wraping the sources I currently use in OpenSearch as well. Hence, any data source would have a consistent query interface, and a consistent return format. If we ensure the later is RDF, then we get aggregation "for free".

So, I've made a start. First up is Yahoo's image search, which I've wrapped as http://darwin.zoology.gla.ac.uk/cgi-bin/yahoo.cgi. You just append "q=" and the search terms to get a result. Try an example search for images of the ant Atta mexicana. Note that I currently just support the return format, not the query format (that'll come later). The query result is RSS 1.0 because it contains RDF (RSS 2.0 and Atom don't, and hence for my purposes are beside the point). The upshot is that I can now use this search in other projects, and making a better iSpecies becomes simply a case of adding a bunch of OpenSearch sources together.

Generating the RSS proved "fun", but the feed now validates as RDF, although Feed Validator grumbles slightly. It's all a bit of a black art, but I had to nest the RDF payload in <content:item> tags, like this:
<foaf:Image rdf:about="http://www.par...x.draw.JPEG">
<dc:description>Leaf-cutter ants (Atta mexicana ) ... </dc:description>
<dc:subject>Atta mexicana</dc:subject>
<foaf:thumbnail rdf:resource="http://mud.mm-a5.yimg.com/image/2050519657"/>

Friday, September 01, 2006

More DOIs

Following on from an earleir post, I've now added DOI extraction for SciELO, which hosts Brazilian publications, and Taylor and Francis. This was motivated by searching iSpecies for the ant Trachymyrmex opulentus, for which only papers hosted by these two publishers appear in the search results.

Again, we are reduced to screen scraping (sigh). Why oh why don't the people who design these web sites get their act together and embed useful information in the HTML, rather than assume that only humans will make use of these pages?

One provider that is clued up is Ingenta. For example, take a look at the HTML for the article "Influence of Topography on the Distribution of Ground-Dwelling Ants in an Amazonian Forest" (doi:10.1076/snfe. on the Ingenta site (Firefox and Camino users can see the source here). Embedded in the <meta> tags is all sorts of metadata, including the DOI:

<meta name="DC.identifier" scheme="URI"

The use of consistently formatted tags makes data extraction much easier. Of course, it's no surprise that Ingenta do this well (check out their blog).