Saturday, December 02, 2006

Open Search and The Nearctic Spider Database - almost there

As announced on TAXACOM, David Shorthouse has added an Open Search interface to his really nice Nearctic Spider Database. As I've noted previously (see Adding sources to iSpecies and OpenSearch and iSpecies ), OpenSearch seems an obvious candidate for a simple way to add search functionality to biodiversity web sites.

The interface is generated by some software called Zoom Search, and the interface is here. As an example, here is a query for the spider Enoplognatha latimana.

But...

Having an easy way to search a site using a URL API such as Open Search is great, but the feed is RSS 2.0, and as a result has very little information. For example, here's an extract:


<item>
 <title>The Nearctic Spider Database: Enoplognatha latimana Hippa & Oksala, 1982 Description</title>
 <link>http://canadianarachnology.dyndns.org/data/spiders/7561</link>
 <description>THERIDIIDAE: Enoplognatha latimana taxonomic and natural history description in the Nearctic Spider Database.</description>
 <zoom:context> ... Descriptions Home Search: Register Log in Enoplognatha latimana Hippa& Oksala, 1982 Temporary ... 2007 Arachnid Calendar FAMILY: THERIDIIDAE Sundevall, 1833 Genus: Enoplognatha Pavesi, 1880 ...</zoom:context>
 <zoom:termsMatched>2</zoom:termsMatched>
 <zoom:score>1804</zoom:score>
 </item>


This information is intended to be displayed in a feed reader, and hence viewed by a human. But, what if I want to put this information in a database, or combine it with other data sources in a mashup, such as iSpecies? Well, I have to scrape information out of free formatted text. In other words, I'm no further forward than if I scraped the original web page.

If we want to make the information accessible to a computer, then we need something else. RDF is the obvious way forward.

The difference that RDF makes

To illustrate the difference, let's search for images of the same spider (Enoplognatha latimana) using my Open Search wrapper for Yahoo's images search (described in OpenSearch and iSpecies). Here is the query. This feed is formatted as RSS 1.0, and I can view it in a feed reader, such as NetNewsWire.



But, because the feed is RSS 1.0 and therefore RDF, the feed contains lots of information on the image in a form that can be easily consumed.


<foaf:Image rdf:about="http://www.spiderling.de/arages/
Fotogalerie/Enoplognatha_latimana_1024.jpg">
 <dc:type>image</dc:type>
 <dc:title>Enoplognatha_latimana_1024.jpg</dc:title>
 <dc:description></dc:description>
 <dc:subject>Enoplognatha latimana</dc:subject>
 <dc:source>http://www.spiderling.de/arages/
Verbreitungskarten/ENO_LAT0.HTM</dc:source>
 <dc:format>image/jpeg</dc:format>
 <foaf:thumbnail rdf:resource=
"http://re3.mm-a1.yimg.com/image/206564554"/>
</foaf:Image>


In this example, I use the FOAF and Dublin Core vocabularies. these are widely used, making it easy to integrate this information into a larger database, such as a triple store. To my mind, this is the way forward. We need to move beyond thinking about making data only accessible to people, and making it accessible to computers. Once we do this, then we can start to aggregate and query the huge amounts of data on the web (as exemplified by David's wonderful site on spiders). And once we do that, we may discover all sorts of things that we don't know (see Disconnected databases, and Discovering new things).

22 comments:

Anonymous said...

Rod,

Thanks for putting thought into this. I wrote a few posts to the Zoom Search forum to solicit comments by the developers: http://www.wrensoft.com/forum/showthread.php?p=5101#post5101. I have no doubt that your very excellent and important suggestions can be implemented. The folks at Zoom Search are quite accommodating and are on top of their game & likely see a huge market here.

Dave

Anonymous said...

A few points regarding your comments about the Zoom Search Engine XML output.

1) The Zoom output format was selected to be compatible with the main existing Opensearch aggregator, A9.
From my reading it seems that A9.com supports both RSS 2.0 and Atom 1.0. (Not RSS1.0 / RDF as you suggest as a preferred format)

2) In the example you gave, there really doesn't appear to be a great deal of additional information in your RSS1.0 example. There is a title, subject, link. The additional thumbnail and source information only applies to image searches. But you didn't search for an image in Zoom. Plus Zoom returned additional information. Terms matched, score & page context. So the claim of very little infomation being returned is incorrect. For the purposes of providing search results, Zoom produces much more useful information.

3) It isn't only a protocol format issue. Zoom can only return the information stored in its search index. For example, Zoom doesn't store a 'subject', but does store a 'description'.

4) I don't see any reason why RSS2.0 would be harder to 'consume' than RSS1.0. In fact I would suggest the simplier RSS2.0 format is easier to deal with.

5) The XML output was not designed to be directly read by humans. Zoom has HTML output for that. It was designed to be easy to parse (not scrape) by scripting languages and aggregators.

Roderic Page said...

Let me respond by saying that it's not my intention to criticise Zoom. It's a cool product. My point was that -- if the ultimate goal is data integration -- it's not enough to provide an easy means to search, or even to return results in a standard format. By way of background, my goal in this area is to advocate using tools from the Semantic Web community. For background see my Semant blog.

Now, point by point:

1) Yes, A9 doesn't support RSS 1.0, which in my opinion is an unfortunate choice on their part.

2) I disagree. The extra information you refer to relates to the search (e.g., score) which, while useful, isn't what I'm after. What my small example gave was a description of an image using an established vocabulary (FOAF) so that others with images described in the same way can aggregate that information. If I wanted, I could extend it further by extracting metadata data from the image itself. Why is this useful? Well again it avoids having a human do this (see Copyright on images for more on this).

4) Depends what the goal is. My goal is to aggregate information from diverse sources and query it. To do this I need several things, such as consistent identifiers (Globally Unique Identifiers or GUIDs), consistent vocabularies (for example, for basic metadata something like Dublin Core, for people FOAF, for publications PRISM, etc.), and tools for storing and querying this information (such as triple stores and languages such as SPARQL). In other words, the Semantic Web.

One way to think about this is to ask "who is going to make use of your feed?" If the answer is "people can view it in a feed reader, or add it as a source to A9 and look at the results" then RSS 2.0/Atom is fine. But you want computers to consume the feed and be able to merge it with other feeds and make inferences, then I suggest we need RSS 1.0 and RDF (at the very least, this makes things a lot easier).

5) Yes, XML is not designed to be directly by humans, but that doesn't necessarily make it easier for computers to handle it. For example, if two different XML sources use different vocabularies to describe an image, how are we supposed to merge those two documents?

I hope this clarifies why I made the comments I did. Zoom provides a nice tool for searching a web site and providing results in a standard, accessible form. It's just that I want more than that, and by adding just a little more (i.e., RDF), the potential payoff becomes so much greater. Now, the Semantic Web may well be outside the area that you guys want to get into, and given the chasm between the hype and the current reality, that may be sensible. But, I think there's great potential there. Imagine a product that makes it easy for users to aggregate search results from different sites. Not just displaying them like A9, but integrating them. Some people seem to think there's money in this (see Tales of a Semantic Web Consultancy).

Anonymous said...

In my opinion the Semantic Web is just an academic pipe dream. And not a practical proposition at this time.

The majority of information on the web is unstructured. HTML pages, PDF files & Word documents without any useful meta data most of the time.

At the present time software is just not sophisticated enough to take arbitrary text and interpret the meaning. Current software would be lucky to even correctly extract the authors name from a Word document. Let along get any deeper meaning and structure.

So in the absense of software that can extract meaing from text, creating highly structured output is only possible with structured input. Which we don't have.

I still remember being told 20 years ago that (EDI / X12) would replace all other methods of exchanging data. It has since died a slow death except in very narrow market segments. I have no reason to think that RDF will do any better.

We (the Zoom SE developers) build what people ask for. Lots of people ask for XML. Quiet a few ask for Opensearch compatibility. Until now no one has been asking for RSS 1.0.

Quote:
"Imagine a product that makes it easy for users to aggregate search results from different sites."

Yes, we did. We built and released the product a couple of weeks ago. You can download a demo of Zoom MasterNode here.
http://www.wrensoft.com/forum/showthread.php?t=1212

Roderic Page said...

Thanks for the comments. I realise that you have to make a living from Zoom, which means you probably have a much keener sense than I of what people actually want right now.

Perhaps the Semantic Web writ large is a pipe dream, but I struggle to see an better way of approaching the task of integrating diverse sources of structured and semi-structured information. Your target audience is much larger than mine, and most information on the web is unstructured. What I've been arguing in various blogs and lists is that in biodiversity informatics we have lots of structured or semi-structured data, and hence the Semantic Web approach is feasible.

I also think indexing that makes use of metadata will become more feasible as time goes on. Metadata tags are now automatically embedded in images taken with digital cameras, blog news feeds have metadata, including geotagging, and documents in XML format with embedded metadata (such as OpenDocument) will become more common. I think it's a case of setting realistic goals, which I agree the Semantic Web community isn't always good at doing. But I suspect the more limited goal of metadata driven searching is a lot closer than some might think.

Anonymous said...

I would prefer to see something like LSIDs instead of microformats. The latter look like a quick & useful system for web developers (but see unanswered potential problem here) and I would worry that in time, their usefulness would degrade to full text, i.e. screen-scraping once again. What Rod is proposing really wouldn't be difficult to do, it's just a matter of demonstrating a market...or making one.

Anonymous said...

"But I suspect the more limited goal of metadata driven searching is a lot closer than some might think."

Indeed it is. Zoom already scrapes image metadata with an image plug-in to construct its index.

Anonymous said...

OK, maybe we're on to something here. I can very easily put LSIDs as a "species" microformat in the <h1> tag for the page and also in the META description. Consequently, Zoom will produce its RSS 2.0/xml and in the description element will be something like:

<description>THERIDIIDAE: Wamba crispulus taxonomic and natural history description in the Nearctic Spider Database. urn:lsid:ubio.org:namebank:3552058</description>

So, an OpenSearch aggregator could strip all text out of this description element with the exception of everything to the right of and including "urn:lsid:"

Anonymous said...

To further simplify programmatic stripping (i.e. scraping) since this isn't the most ideal method...would be nice if there was a dedicated Dublin Core element for this sort of stuff...one could also wrap the LSID in agreed upon brackets in the RSS 2.0 description element for zippier character matching:

e.g. [urn:lsid:ubio.org:namebank:3552058]

Roderic Page said...

The obvious candidate Dublin Core element is <dc:identifier>. However, this is still a bit blunt. I guess what I'd really like is if the RSS feed gave more information, such as a summary of links between names (x with LSID is a synonym of y with another LSID) and other GUIDs (name x with LSID occurs in paper z with DOI ). But, as a stop gap, adding LSIDs to fields that would be indexed would be a start.

Anonymous said...

True, all those extra bits would be nice, but we have to remember that the feed is produced as a result of a search. So, in the case of The Nearctic Spider Database, if one were to search a known synonym, the top result in the RSS feed will be a page for a species with the currently recognized nomenclature (on that page will be a table with all the recognized nomenclature for the species, which of course is indexed by Zoom). I suppose we have to step back and really think what we'd want from these sorts of feeds, top search results from an OpenSearch provider or a name server-like output, which would likely be too much to ask of museums and other institutions hosting species pages. I would hope that reference and name mapping could be done by providers of those services & could be coordinated/joined via the LSIDs. In my case, I could also add DOI (where available) in the description element for the RSS feed using some sort of similar "[DOI...]" type convention. But, dumping too much stuff in what is really just free-form text will make stripping/scraping that much more difficult.

Roderic Page said...

"But, dumping too much stuff in what is really just free-form text will make stripping/scraping that much more difficult."

And that is precisely why I want RDF! All this scraping-stuff is unnecessary. There's nothing stopping you exporting RDF with lots of useful metadata as an Open Search result (even if Zoom won't support this -- one could always write the search function from scratch).

Anonymous said...

Indeed, one could write the search function from scratch & RDF might be the way to go. What we need is an attractive package for species page providers (and others) that will permit full-text search & OpenSearch or RDF output. As far as I have been able to find, Zoom Search is the only package that does a whole lot and is relatively inexpensive. As I have said all along, it would be worth investigating if the folks at Wrensoft (who developed Zoom Search) would be able to produce a customized package capable of OpenSearch, RDF, and full-text search results. If there is such a package now, please point me in that direction. I don't want to write a function myself because I have ~4,500 such species pages in The Nearctic Spider Database & my time is limited. So, RSS 2.0 is the best, currently available option in Zoom.

Anonymous said...

Sorry to get this discussion going again when clearly it has died down, but in the face of a rather pervasive move to RSS 2.0 in the blogosphere & elsewhere like the OpenSearch specification, I'm wondering how useful something like RSS 2.0 extensions like Yahoo's Media RSS (http://search.yahoo.com/mrss) & also the rather new "Open Media Profile" undertaken by Six Apart (a marriage between OpenSearch & Media RSS extension) would be when paired with Simple Semantic Resolution (http://dannyayers.com:88/xmlns/ssr/index.htm). With a stylesheet, one could deconstruct an RSS 2.0 feed into RDF and those RSS 2.0 extension/modules would nicely fit into RDF.

Anonymous said...

Discovered a really nice firefox extension (OpenSearchFox) that turns any data source into an opensearch data source and it works really well - this could be used to add open search capabilities to data providers that won't do it themselves.
https://addons.mozilla.org/firefox/3698/

Anonymous said...

Unless I'm missing something, that FireFox plug-in is nothing more than an automated tool for form submission. What we need is a means to aggregate such results from multiple providers, which means we need xml outputs, not HTML outputs.

google优化 said...

hy无锡乐洋化机公司主要采购反应设备销售反应设备反应设备商机反应设备产品反应设备公司反应设备供应商反应设备市场反应设备价格行情反应设备展会信息反应设备行业资讯反应设备反应设备本公司主要生产反应设备销售反应设备制造反应设备和各种产品我们是反应设备供应商有很大的反应设备市场反应设备详细情况可以访问反应设备专业网本公司主要生产反应设备
销售反应设备冷凝器冷凝器冷凝器冷凝器冷凝器反应锅反应锅反应锅反应锅反应锅反应釜反应釜反应釜反应釜反应釜反应釜反应釜反应釜反应釜搅拌设备搅拌设备搅拌设备不锈钢反应釜冷凝器冷凝器冷凝器冷凝器冷凝器展会信息冷凝器行业资讯反应锅反应锅反应锅反应锅反应锅反应釜反应釜反应釜反应釜反应锅反应釜换热器换热器换热器换热器

Unknown said...

知多半島 温泉
知多半島 旅館
埼玉 不動産
三井ダイレクト
カラーコンタクト
カーボンオフセット
コンタクトレンズ
クレジットカード 海外旅行保険
ゼネラリ
広島 不動産
お見合いパーティー
募金
松山市 不動産
賃貸
不動産
岡山 不動産

Anonymous said...

What do you know Pirates of the Burning Sea Gold. And do you want to know? You can get potbs gold here. And welcome to our website, here you can play games, and you will get potbs Doubloon to play game. I know potbs money, and it is very interesting.Do you want a try, come and view our website, and you will learn how to buy potbs Doubloon. Come and join with us. We are waiting for your coming.

Do you know that the rappelz rupees? The players often forget to eat meal when they play the online games. In the game many palyers need the rappelz gold to up their levels. so they often search where can buy rupees, I think our website maybe is your best choice. Many friends told me that in here can get rappelz money, and here you can also relax yourself. so i hope more and more players come here to buy the cheap rappelz rupees.

wow power leveling said...

Gold key link for (wow power leveling) the law by all, such (wow power leveling) as bubble shadow dream (wow power leveling) hallucinations, such (power leveling) as exposed as well as (wow gold) electricity, should be (wow powerleveling) the case ZHANG GUANG TONG

Anonymous said...

aion chinaaion china gold,
aion cn goldaion chinese gold,
aion gold chinaaion gold chinese,
china aion goldchinese aion gold,
aion china kinaaion chinese kina,
aion kina chinachina aion kina,
aion china buybuy aion china,
aion chinese server

gold
aion cn server

gold
,
aion china server goldchina aion server gold,
chinese aion server

gold
aion chinese server

gold
,
aion cn server kinaaion china server kina,
china aion server kinachinese aion server kina

Anonymous said...

無料 出会い 競馬予想 無料 競馬予想 競馬予想 無料 競馬予想 無料 競馬予想 無料 近視 手術 メル友 出会い 出会い 出会い 出会い メル友 メル友 人妻 メル友 ギャンブル依存症 AV女優 無料 出会い 出逢い 掲示板 出会い系 無料 出会い 人妻 出会い 人妻 出会い セフレ 人妻 出会い セックスフレンド メル友 出会い SM 愛人 不倫 セフレ 無料 出会い 出会い系 無料 無料 出会い 富士山 写真 富士山 メル友 無臭性動画 カリビアムコム 一本堂 出会い 人妻 セフレ ハメ撮り エッチな0930 メル友 無料 出会い 無料 出会い セフレ セフレ セフレ セックスフレンド セックスフレンド セックスフレンド 人妻 出会い 人妻 出会い 人妻 出会い 出会い系 出会い系 出会い系 カリビアンカム カリビアンカム