Sunday, March 19, 2006

Building the encyclopedia of life

iSpecies is very limited in the sources it uses, and also in what it extracts from its sources. The sources it does query contain a wealth of information. As an example, GenBank sequence AF131710 from Ligophorus mugilinus has the following information about this animal:

FEATURES Location/Qualifiers
source 1..374
/organism="Ligophorus mugilinus"
/mol_type="genomic DNA"
/specific_host="Mugil cephalus"

Note the tags "/specific_host" and "/country". By parsing this record we learn that this organism is found in France, and is hosted by Mugil cephalus.

In the same way, the Google Scholar results could be more effectively used. In many cases we could follow the links to get abstracts of articles, then use literature data mining techniques (e.g., Hirschman et al.) to extract information on the organism's ecology, etc.

Extracting this sort of information would be an one way to automate the construction of an encyclopedia of life.