Tuesday, June 10, 2008

Offline

iSpecies was off-line for a few hours today. I moved it from a local folder in my user folder to the /Library/Server folder on the web server, and associated ispecies.org with it's own IP address (although it is still served from the same machine). Glasgow University's DNS seems takes a while to update, so consequently the site appeared to be broken. A quick external check using Network-Tools.com confirmed that ispecies.org had the new IP address, but locally it was still resolving to the holding page of 123-reg, with whom I registered the domain. By fussing with the VirtualHost directive in the Apache httpd.conf file, I managed to get it working again.
NameVirtualHost 130.209.46.63
<VirtualHost 130.209.46.63>
DocumentRoot "/Library/WebServer/ispecies"
ServerName ispecies.org
ServerSignature email
DirectoryIndex index.php index.html index.htm index.shtml
LogLevel warn
HostNameLookups off
<Directory "/Library/WebServer/ispecies">
allow from all
Options +Indexes
</Directory>
</VirtualHost>

The only difference users may notice is that the URLs will now always start with http://ispecies.org.

Tuesday, March 25, 2008

Wikipedia on iSpecies


I've added snippets from Wikipedia to iSpecies results, in part inspired by FreeBase. This makes use of the XML export format . For example, the URL http://en.wikipedia.org / wiki / Special:Export / Luzon_Montane_Forest_Mouse returns XML, with the wiki markup enclosed in the tags <text xml:space="preserve"></text> I use some simple regular expressions to strip some of the markup out, including the taxobox, then I grab the first 100 words of the article to display on the iSpecies page (together with a link to the original article).

Because a species may have multiple names, we need to handle redirection. For example, the URL http://en.wikipedia.org / wiki / Special:Export / Apomys_datae returns
<text xml:space="preserve">#Redirect [[Luzon Montane Forest Mouse]]</text>

which tells us that the content is to be found at http://en.wikipedia.org / wiki / Special:Export / Luzon_Montane_Forest_Mouse.

There's still some polishing to do, but the Wikipedia snippets add something to the iSpecies results.

Thursday, August 30, 2007

Maps, and a Google tweak


Today I stumbled across the Species Distribution Widget from GBIF (written by Tim Robertson and Dave Martin). For Mac OS X 10.4 users, this provides a cool way to quickly get a distribution map for a taxon. Given that Apple dashboard widgets are essentially Javascript and HTML, it occurred to me to reverse engineer the widget to see what it did. To open the widget you just "Ctrl-click" on the widget icon, select Show Package Contents, and the contents open in a Finder window.



The guts of the widget is in the scripts folder. This contains a Javascript file. The widget calls the URL http://data.gbif.org/species/taxonName/ajax/returnType/concept /view/ajaxMapUrls/provider/1/?query=, to which is appended the taxon name you are searching for. Back comes the result in XML. For example, searching for Apus apus returns:
<taxons>
<taxon>
<name>Apus apus</name>
<commonName>Common swift</commonName>
<key>13836131</key>
<url>species/13836131/overviewMap.png</url>
</taxon>
</taxons>

(Shouldn't "taxons" be "taxa"?) The URL of the corresponding map is given in the <url> tag. Append this to "http://data.gbif.org/, and you have the URL for the image of the map. For example, here's the map for Apus apus.


I've added code to do this to iSpecies, so it now features maps from from GBIF. I've also finally tweaked the Google code to stop mangling UTF-8 characters.

Monday, March 05, 2007

5 Ways to Mix, Rip, and Mash Your Data

Spotted by Simon Rycroft, Nick Gonzalez has a comparison of maship scripts entitled: 5 Ways to Mix, Rip, and Mash Your Data.
Call them pipes, teqlos, dapps, modules, mashups or whatever else but fact is that recently we have seen a good number of new services that allow developers and users to build mini-apps and mashups that mix and re-mix data. Here we run through 5 applications that allow you to mix, rip and mash your data, looking at the data input, output, REST support, suggested use, and required skill level.


Clearly, this stuff is attracting a lot of attention.

Saturday, March 03, 2007

Wikis and the future of iSpecies

So, where next for iSpecies? An obvious route seems to be adding a Wiki, something I've discussed on SemAnt. Imagine pre-populating a Wiki with search results from iSpecies, especially if we drilled down using the links in the NCBI search results to extract further content, and made use of the improved mapping between NCBI and TreeBASE names (TBMap).

A few things have stopped me from implementing this. One is the problem that Wiki's are (usually) just unstructured text. However, semantic wikis are starting to emerge (e.g., Semantic MediaWiki, and Rhizome -- I'll be adding links to more at del.icio.us/rdmpage/semantic-wiki). Using a semantic wiki means we can enter structured information and render it as RDF, which would make it potentially a great way to cpature basic facts (triples) about a taxon, but still have human-readable and editable documents.

I've been pondering this, and toying with either writing something myself, or using an off the shelf solution. It's like that I may write something, because I want to link it to a triple store, and I want to pre-populate the wiki as much as possible.

One minor thing that has been holding me back is thinking about URLs to link to the content. For example, I'd like to be able to do the following:
  • Link to a page by either a unique numerical identifier (e.g., "wiki/0342001", or a name (e.g., "wiki/Physeter catodon"). If the user enters the numerical version, they get directed to the text identifier.

  • If a name is a synonym, redirect user to that page. For example, "wiki/Physeter macrocephalus" would redirect to "wiki/Physeter catodon").

  • If the name is a homonym, display a disambiguation page listing the different taxa with that name.

  • If a user creates a URL that doesn't exist, the wiki would offer to make a new page, after checking that the URL tag is a scientific name (say by using uBio's XML web service).


I've been learning about the joys of Apache's mode-rewrite, which looks like a nice way to deal with some of these issues. For example, this .htaccess file handles both numerical and text identifiers.

# Don't mess with the actual script call
RewriteRule ^get.php* - [L]
# URL is numerical id
RewriteRule (^[0-9]*$) get.php?id=$1 [L]
# URL is tag name
RewriteRule (^[A-Za-z](.*)) get.php?name=$1 [L]

Then, the code in get.php would do display the appropriate page. If the parameter is a numerical id, it's a simple database lookup (numerical identifiers are great because databases handle them easily, and they can be stored without worrying about issues such as capitalisation and punctionation). If it's a name we follow the steps outlined above to handle synonyms, etc.

The point of this is that we get clean URLs, but users can still link using natural URLs like those in WikiPedia and WikiSpecies. Given this, why don't I use WikiSpecies? Well, because it's not a semantic wiki, so I don't gain anything from locking information up in this format.

Thursday, February 22, 2007

RSSBus


David Shorthouse altered me to RSSBus, which is similar to Yahoo's Pipes, but Lance Robinson (the "Tech Evangelist" at RSSBus) argues that their product is much better. What is RSSBus?
RSSBus is a Really Simple Service Bus that uses the RSS protocol as the main interchange mechanism. RSS is an extensible protocol used to exchange Feeds of Items. Normally these are news items or blog postings, but they don't have to be: RSS Feeds may be augmented through standard RSS extensions to exchange any type of data.

RSSBus is a collection of tools and services that simplify the process of creating RSS Feeds with rich data extensions. Feeds are generated from RSSBus Connectors, reusable code modules that convert data into feeds. They do so by communicating with RSSBus over defined interfaces (please refer to our RSSBus Connectors Reference for details on building custom connectors).

RSSBus provides an infrastructure for generating, maintaining, combining, manipulating, and visualizing Feeds. Items and Feeds are orchestrated by the RSSBus Engine and together help create a loosely integrated application architecture which we like to refer to as RSS Web.

David says he has managed to recreate iSpecies on his desktop with RSSBus, which sounds cool. So far RSSBus is a Windows only tool, although there is code for other platforms listed on the blog. There is also a white paper.
Looks like the conversation on OpenSearch, RSS, and biodiversity informatics has only just got started.

Saturday, December 02, 2006

Open Search and The Nearctic Spider Database - almost there

As announced on TAXACOM, David Shorthouse has added an Open Search interface to his really nice Nearctic Spider Database. As I've noted previously (see Adding sources to iSpecies and OpenSearch and iSpecies ), OpenSearch seems an obvious candidate for a simple way to add search functionality to biodiversity web sites.

The interface is generated by some software called Zoom Search, and the interface is here. As an example, here is a query for the spider Enoplognatha latimana.

But...

Having an easy way to search a site using a URL API such as Open Search is great, but the feed is RSS 2.0, and as a result has very little information. For example, here's an extract:


<item>
 <title>The Nearctic Spider Database: Enoplognatha latimana Hippa & Oksala, 1982 Description</title>
 <link>http://canadianarachnology.dyndns.org/data/spiders/7561</link>
 <description>THERIDIIDAE: Enoplognatha latimana taxonomic and natural history description in the Nearctic Spider Database.</description>
 <zoom:context> ... Descriptions Home Search: Register Log in Enoplognatha latimana Hippa& Oksala, 1982 Temporary ... 2007 Arachnid Calendar FAMILY: THERIDIIDAE Sundevall, 1833 Genus: Enoplognatha Pavesi, 1880 ...</zoom:context>
 <zoom:termsMatched>2</zoom:termsMatched>
 <zoom:score>1804</zoom:score>
 </item>


This information is intended to be displayed in a feed reader, and hence viewed by a human. But, what if I want to put this information in a database, or combine it with other data sources in a mashup, such as iSpecies? Well, I have to scrape information out of free formatted text. In other words, I'm no further forward than if I scraped the original web page.

If we want to make the information accessible to a computer, then we need something else. RDF is the obvious way forward.

The difference that RDF makes

To illustrate the difference, let's search for images of the same spider (Enoplognatha latimana) using my Open Search wrapper for Yahoo's images search (described in OpenSearch and iSpecies). Here is the query. This feed is formatted as RSS 1.0, and I can view it in a feed reader, such as NetNewsWire.



But, because the feed is RSS 1.0 and therefore RDF, the feed contains lots of information on the image in a form that can be easily consumed.


<foaf:Image rdf:about="http://www.spiderling.de/arages/
Fotogalerie/Enoplognatha_latimana_1024.jpg">
 <dc:type>image</dc:type>
 <dc:title>Enoplognatha_latimana_1024.jpg</dc:title>
 <dc:description></dc:description>
 <dc:subject>Enoplognatha latimana</dc:subject>
 <dc:source>http://www.spiderling.de/arages/
Verbreitungskarten/ENO_LAT0.HTM</dc:source>
 <dc:format>image/jpeg</dc:format>
 <foaf:thumbnail rdf:resource=
"http://re3.mm-a1.yimg.com/image/206564554"/>
</foaf:Image>


In this example, I use the FOAF and Dublin Core vocabularies. these are widely used, making it easy to integrate this information into a larger database, such as a triple store. To my mind, this is the way forward. We need to move beyond thinking about making data only accessible to people, and making it accessible to computers. Once we do this, then we can start to aggregate and query the huge amounts of data on the web (as exemplified by David's wonderful site on spiders). And once we do that, we may discover all sorts of things that we don't know (see Disconnected databases, and Discovering new things).