Sunday, March 19, 2006

Building the encyclopedia of life

iSpecies is very limited in the sources it uses, and also in what it extracts from its sources. The sources it does query contain a wealth of information. As an example, GenBank sequence AF131710 from Ligophorus mugilinus has the following information about this animal:

FEATURES Location/Qualifiers
source 1..374
/organism="Ligophorus mugilinus"
/mol_type="genomic DNA"
/specific_host="Mugil cephalus"

Note the tags "/specific_host" and "/country". By parsing this record we learn that this organism is found in France, and is hosted by Mugil cephalus.

In the same way, the Google Scholar results could be more effectively used. In many cases we could follow the links to get abstracts of articles, then use literature data mining techniques (e.g., Hirschman et al.) to extract information on the organism's ecology, etc.

Extracting this sort of information would be an one way to automate the construction of an encyclopedia of life.

Towards a faster iSpecies: building libxml and libxslt on Mac OS X

iSpecies is written in PHP, and calls a Perl CGI script (to query Google Scholar). This works, but is a bit slow. It also puts limits on what we can do. For example, it would be cool to make the search multithreaded so that the different sources are queried at the same time. This becomes a major issue if we want to "drill down." For example, if a taxon exists in NCBI, it would be useful to visit all the LinkOut resources and collect whatever information they make available. Likewise, Google Scholar results contain links to publishers that could be explored further (such as extracting bibliographic information from RIS files, or RSS feeds such as those available for Ingenta-hosted journals). All of this would delay displaying search results to the user, especially if we have to visit one link after another.

Multithreading would help, but PHP doesn't do this, hence I'm toying with moving to C++ and building a "proper" application (I don't do Java). This means I need to get XML, XPath, and XSLT libraries for C/C++, and this has been, ahem, interesting. Was going to use Sablotron (which I use in my PHP 4 and Perl work), but its documentation is just awful (where are some nice examples?). Will probably use libxml and libxslt. These come with Mac OS X 10.3.9 (I do my development on a G4 iBook, before moving stuff to a Linux box), but Apple hasn't compiled libxml with XPath support (sigh). I built libxml 2.2.63 OK, but libxslt 1.1.15 needed a little hand holding because of the presence of Apple's libxml. The following does the trick:

./configure --with-libxml-prefix=/usr/local

This tells configure to use the version of libxml I installed in /usr/local. Now, once I get my head around libcurl I'll try and build something and see if we can speed up iSpecies.