Sunday, January 29, 2006

Automatic extraction of references from a paper

One goal for iSpecies would be integrating taxonomic literature into the output. This has been motivated by Donat Agosti's efforts to make the taxonomic literature for ants available (see his letter to Nature about copyright and biopiracy doi:10.1038/439392a). For example, we can take a paper marked up in an XML schema such as the TaxonX Treatment Markup, extract the treatments of a name, and insert these into a triple store that iSpecies can query. For a crude example search iSpecies for the "Google ant" Proceratium google.

Now, marking up documents by hand (which is what Donat does) is tedious in the extreme. How can we automate this? In particular, I'd like to automate extracting taxonomic names, and references to other papers. The first can be facilitated by taxonomic name servers, particularly uBio's FindIT SOAP service. Extracting references seems more of a challenge, but tonight I stumbled across ParaCite, which looks like it might do the trick. There is Perl code available from CPAN (although when I tried this on Mac OS X 10.3.9 using cpan it failed to build) and from the downloads section of ParaCite. I grabbed Biblio-Citation-Parser-1.10, installed the dependencies via cpan, then built Biblio::Citation::Parser, and so far it looks promising. If references can be readily extracted from taxonomic markup, then this tool could be used to extract the bibliographic information and hence we could look up the references, both in taxon-specific databases such as AntBase, but also in Google Scholar.

No comments: