Saturday, December 02, 2006

Open Search and The Nearctic Spider Database - almost there

As announced on TAXACOM, David Shorthouse has added an Open Search interface to his really nice Nearctic Spider Database. As I've noted previously (see Adding sources to iSpecies and OpenSearch and iSpecies ), OpenSearch seems an obvious candidate for a simple way to add search functionality to biodiversity web sites.

The interface is generated by some software called Zoom Search, and the interface is here. As an example, here is a query for the spider Enoplognatha latimana.


Having an easy way to search a site using a URL API such as Open Search is great, but the feed is RSS 2.0, and as a result has very little information. For example, here's an extract:

 <title>The Nearctic Spider Database: Enoplognatha latimana Hippa & Oksala, 1982 Description</title>
 <description>THERIDIIDAE: Enoplognatha latimana taxonomic and natural history description in the Nearctic Spider Database.</description>
 <zoom:context> ... Descriptions Home Search: Register Log in Enoplognatha latimana Hippa& Oksala, 1982 Temporary ... 2007 Arachnid Calendar FAMILY: THERIDIIDAE Sundevall, 1833 Genus: Enoplognatha Pavesi, 1880 ...</zoom:context>

This information is intended to be displayed in a feed reader, and hence viewed by a human. But, what if I want to put this information in a database, or combine it with other data sources in a mashup, such as iSpecies? Well, I have to scrape information out of free formatted text. In other words, I'm no further forward than if I scraped the original web page.

If we want to make the information accessible to a computer, then we need something else. RDF is the obvious way forward.

The difference that RDF makes

To illustrate the difference, let's search for images of the same spider (Enoplognatha latimana) using my Open Search wrapper for Yahoo's images search (described in OpenSearch and iSpecies). Here is the query. This feed is formatted as RSS 1.0, and I can view it in a feed reader, such as NetNewsWire.

But, because the feed is RSS 1.0 and therefore RDF, the feed contains lots of information on the image in a form that can be easily consumed.

<foaf:Image rdf:about="
 <dc:subject>Enoplognatha latimana</dc:subject>
 <foaf:thumbnail rdf:resource=

In this example, I use the FOAF and Dublin Core vocabularies. these are widely used, making it easy to integrate this information into a larger database, such as a triple store. To my mind, this is the way forward. We need to move beyond thinking about making data only accessible to people, and making it accessible to computers. Once we do this, then we can start to aggregate and query the huge amounts of data on the web (as exemplified by David's wonderful site on spiders). And once we do that, we may discover all sorts of things that we don't know (see Disconnected databases, and Discovering new things).


Anonymous said...


Thanks for putting thought into this. I wrote a few posts to the Zoom Search forum to solicit comments by the developers: I have no doubt that your very excellent and important suggestions can be implemented. The folks at Zoom Search are quite accommodating and are on top of their game & likely see a huge market here.


Anonymous said...

A few points regarding your comments about the Zoom Search Engine XML output.

1) The Zoom output format was selected to be compatible with the main existing Opensearch aggregator, A9.
From my reading it seems that supports both RSS 2.0 and Atom 1.0. (Not RSS1.0 / RDF as you suggest as a preferred format)

2) In the example you gave, there really doesn't appear to be a great deal of additional information in your RSS1.0 example. There is a title, subject, link. The additional thumbnail and source information only applies to image searches. But you didn't search for an image in Zoom. Plus Zoom returned additional information. Terms matched, score & page context. So the claim of very little infomation being returned is incorrect. For the purposes of providing search results, Zoom produces much more useful information.

3) It isn't only a protocol format issue. Zoom can only return the information stored in its search index. For example, Zoom doesn't store a 'subject', but does store a 'description'.

4) I don't see any reason why RSS2.0 would be harder to 'consume' than RSS1.0. In fact I would suggest the simplier RSS2.0 format is easier to deal with.

5) The XML output was not designed to be directly read by humans. Zoom has HTML output for that. It was designed to be easy to parse (not scrape) by scripting languages and aggregators.

Roderic Page said...

Let me respond by saying that it's not my intention to criticise Zoom. It's a cool product. My point was that -- if the ultimate goal is data integration -- it's not enough to provide an easy means to search, or even to return results in a standard format. By way of background, my goal in this area is to advocate using tools from the Semantic Web community. For background see my Semant blog.

Now, point by point:

1) Yes, A9 doesn't support RSS 1.0, which in my opinion is an unfortunate choice on their part.

2) I disagree. The extra information you refer to relates to the search (e.g., score) which, while useful, isn't what I'm after. What my small example gave was a description of an image using an established vocabulary (FOAF) so that others with images described in the same way can aggregate that information. If I wanted, I could extend it further by extracting metadata data from the image itself. Why is this useful? Well again it avoids having a human do this (see Copyright on images for more on this).

4) Depends what the goal is. My goal is to aggregate information from diverse sources and query it. To do this I need several things, such as consistent identifiers (Globally Unique Identifiers or GUIDs), consistent vocabularies (for example, for basic metadata something like Dublin Core, for people FOAF, for publications PRISM, etc.), and tools for storing and querying this information (such as triple stores and languages such as SPARQL). In other words, the Semantic Web.

One way to think about this is to ask "who is going to make use of your feed?" If the answer is "people can view it in a feed reader, or add it as a source to A9 and look at the results" then RSS 2.0/Atom is fine. But you want computers to consume the feed and be able to merge it with other feeds and make inferences, then I suggest we need RSS 1.0 and RDF (at the very least, this makes things a lot easier).

5) Yes, XML is not designed to be directly by humans, but that doesn't necessarily make it easier for computers to handle it. For example, if two different XML sources use different vocabularies to describe an image, how are we supposed to merge those two documents?

I hope this clarifies why I made the comments I did. Zoom provides a nice tool for searching a web site and providing results in a standard, accessible form. It's just that I want more than that, and by adding just a little more (i.e., RDF), the potential payoff becomes so much greater. Now, the Semantic Web may well be outside the area that you guys want to get into, and given the chasm between the hype and the current reality, that may be sensible. But, I think there's great potential there. Imagine a product that makes it easy for users to aggregate search results from different sites. Not just displaying them like A9, but integrating them. Some people seem to think there's money in this (see Tales of a Semantic Web Consultancy).

Anonymous said...

In my opinion the Semantic Web is just an academic pipe dream. And not a practical proposition at this time.

The majority of information on the web is unstructured. HTML pages, PDF files & Word documents without any useful meta data most of the time.

At the present time software is just not sophisticated enough to take arbitrary text and interpret the meaning. Current software would be lucky to even correctly extract the authors name from a Word document. Let along get any deeper meaning and structure.

So in the absense of software that can extract meaing from text, creating highly structured output is only possible with structured input. Which we don't have.

I still remember being told 20 years ago that (EDI / X12) would replace all other methods of exchanging data. It has since died a slow death except in very narrow market segments. I have no reason to think that RDF will do any better.

We (the Zoom SE developers) build what people ask for. Lots of people ask for XML. Quiet a few ask for Opensearch compatibility. Until now no one has been asking for RSS 1.0.

"Imagine a product that makes it easy for users to aggregate search results from different sites."

Yes, we did. We built and released the product a couple of weeks ago. You can download a demo of Zoom MasterNode here.

Roderic Page said...

Thanks for the comments. I realise that you have to make a living from Zoom, which means you probably have a much keener sense than I of what people actually want right now.

Perhaps the Semantic Web writ large is a pipe dream, but I struggle to see an better way of approaching the task of integrating diverse sources of structured and semi-structured information. Your target audience is much larger than mine, and most information on the web is unstructured. What I've been arguing in various blogs and lists is that in biodiversity informatics we have lots of structured or semi-structured data, and hence the Semantic Web approach is feasible.

I also think indexing that makes use of metadata will become more feasible as time goes on. Metadata tags are now automatically embedded in images taken with digital cameras, blog news feeds have metadata, including geotagging, and documents in XML format with embedded metadata (such as OpenDocument) will become more common. I think it's a case of setting realistic goals, which I agree the Semantic Web community isn't always good at doing. But I suspect the more limited goal of metadata driven searching is a lot closer than some might think.

Charles said...

An alternative to RDF would be to use a microformat; specifically the species microformat. Microformats are currently realising what RDF would hope to achieve. People are making real use out of microformats now, and they increasingly have real support from the real, in-the-trenches web developers. If the semantic web is to be realised, I believe it is microformats that will be the technology to win out simply because they are easy to implement and understand and work with.

What do you think Rob?

Anonymous said...

I would prefer to see something like LSIDs instead of microformats. The latter look like a quick & useful system for web developers (but see unanswered potential problem here) and I would worry that in time, their usefulness would degrade to full text, i.e. screen-scraping once again. What Rod is proposing really wouldn't be difficult to do, it's just a matter of demonstrating a market...or making one.

Anonymous said...

"But I suspect the more limited goal of metadata driven searching is a lot closer than some might think."

Indeed it is. Zoom already scrapes image metadata with an image plug-in to construct its index.

Charles said...

David, LSIDs and Microformats are completely different things; an LSID is a type of globally unique identifier, while a microformat is a standard way of semantically marking up text. A microformat could make use of LSIDs - They aren't mutually exclusive. I'm also not sure how you see their usefulness degrading over time.

I suspect there may be a misunderstanding as to what microformats are here. I would draw your attention to the definition, the introduction to microformats and a post I found comparing RDF, OWL and microformats.

The advantage of microformats (simple, compact, easy to understand and work with) cannot be underestimated on the web. Simple, lean technologies can take us a long way - far further than anyone imagined - as we have found with HTML. The key is in getting people to actually use the technology and that's why simplicity and accessibility are so important. You see, microformats don't necessarily compete with RDF; they complement it, lowering the barrier to for creating the semantic web. Have a look at the microformats FAQ for RDF for more in this.

Charles said...

Here's another article contrasting RDF and microformats

Anonymous said...

OK, maybe we're on to something here. I can very easily put LSIDs as a "species" microformat in the <h1> tag for the page and also in the META description. Consequently, Zoom will produce its RSS 2.0/xml and in the description element will be something like:

<description>THERIDIIDAE: Wamba crispulus taxonomic and natural history description in the Nearctic Spider Database.</description>

So, an OpenSearch aggregator could strip all text out of this description element with the exception of everything to the right of and including "urn:lsid:"

Anonymous said...

To further simplify programmatic stripping (i.e. scraping) since this isn't the most ideal method...would be nice if there was a dedicated Dublin Core element for this sort of could also wrap the LSID in agreed upon brackets in the RSS 2.0 description element for zippier character matching:

e.g. []

Roderic Page said...

The obvious candidate Dublin Core element is <dc:identifier>. However, this is still a bit blunt. I guess what I'd really like is if the RSS feed gave more information, such as a summary of links between names (x with LSID is a synonym of y with another LSID) and other GUIDs (name x with LSID occurs in paper z with DOI ). But, as a stop gap, adding LSIDs to fields that would be indexed would be a start.

Anonymous said...

True, all those extra bits would be nice, but we have to remember that the feed is produced as a result of a search. So, in the case of The Nearctic Spider Database, if one were to search a known synonym, the top result in the RSS feed will be a page for a species with the currently recognized nomenclature (on that page will be a table with all the recognized nomenclature for the species, which of course is indexed by Zoom). I suppose we have to step back and really think what we'd want from these sorts of feeds, top search results from an OpenSearch provider or a name server-like output, which would likely be too much to ask of museums and other institutions hosting species pages. I would hope that reference and name mapping could be done by providers of those services & could be coordinated/joined via the LSIDs. In my case, I could also add DOI (where available) in the description element for the RSS feed using some sort of similar "[DOI...]" type convention. But, dumping too much stuff in what is really just free-form text will make stripping/scraping that much more difficult.

Roderic Page said...

"But, dumping too much stuff in what is really just free-form text will make stripping/scraping that much more difficult."

And that is precisely why I want RDF! All this scraping-stuff is unnecessary. There's nothing stopping you exporting RDF with lots of useful metadata as an Open Search result (even if Zoom won't support this -- one could always write the search function from scratch).

Anonymous said...

Indeed, one could write the search function from scratch & RDF might be the way to go. What we need is an attractive package for species page providers (and others) that will permit full-text search & OpenSearch or RDF output. As far as I have been able to find, Zoom Search is the only package that does a whole lot and is relatively inexpensive. As I have said all along, it would be worth investigating if the folks at Wrensoft (who developed Zoom Search) would be able to produce a customized package capable of OpenSearch, RDF, and full-text search results. If there is such a package now, please point me in that direction. I don't want to write a function myself because I have ~4,500 such species pages in The Nearctic Spider Database & my time is limited. So, RSS 2.0 is the best, currently available option in Zoom.

Anonymous said...

Sorry to get this discussion going again when clearly it has died down, but in the face of a rather pervasive move to RSS 2.0 in the blogosphere & elsewhere like the OpenSearch specification, I'm wondering how useful something like RSS 2.0 extensions like Yahoo's Media RSS ( & also the rather new "Open Media Profile" undertaken by Six Apart (a marriage between OpenSearch & Media RSS extension) would be when paired with Simple Semantic Resolution ( With a stylesheet, one could deconstruct an RSS 2.0 feed into RDF and those RSS 2.0 extension/modules would nicely fit into RDF.

Anonymous said...

Discovered a really nice firefox extension (OpenSearchFox) that turns any data source into an opensearch data source and it works really well - this could be used to add open search capabilities to data providers that won't do it themselves.

Anonymous said...

Unless I'm missing something, that FireFox plug-in is nothing more than an automated tool for form submission. What we need is a means to aggregate such results from multiple providers, which means we need xml outputs, not HTML outputs.

google优化 said...


Anonymous said...

black mold exposureblack mold symptoms of exposurewrought iron garden gatesiron garden gates find them herefine thin hair hairstylessearch hair styles for fine thin hairnight vision binocularsbuy night vision binocularslipitor reactionslipitor allergic reactionsluxury beach resort in the philippines

afordable beach resorts in the philippineshomeopathy for big with great mineral makeup bargainsmineral makeup wholesalersprodam iphone Apple prodam iphone prahacect iphone manualmanual for P 168 iphonefero 52 binocularsnight vision Fero 52 binocularsThe best night vision binoculars here

night vision binoculars bargainsfree photo albums computer programsfree software to make photo albumsfree tax formsprintable tax forms for free craftmatic air bedcraftmatic air bed adjustable info hereboyd air bedboyd night air bed lowest pricefind air beds in wisconsinbest air beds in wisconsincloud air beds

best cloud inflatable air bedssealy air beds portableportables air bedsrv luggage racksaluminum made rv luggage racksair bed raisedbest form raised air bedsaircraft support equipmentsbest support equipments for aircraftsbed air informercialsbest informercials bed airmattress sized air beds

bestair bed mattress antique doorknobsantique doorknob identification tipsdvd player troubleshootingtroubleshooting with the dvd playerflat panel television lcd vs plasmaflat panel lcd television versus plasma pic the bestThe causes of economic recessionwhat are the causes of economic recessionadjustable bed air foam The best bed air foam

hoof prints antique equestrian printsantique hoof prints equestrian printsBuy air bedadjustablebuy the best adjustable air bedsair beds canadian storesCanadian stores for air beds

migraine causemigraine treatments floridaflorida headache clinicdrying dessicantair drying dessicantdessicant air dryerpediatric asthmaasthma specialistasthma children specialistcarpet cleaning dallas txcarpet cleaners dallascarpet cleaning dallas

vero beach vacationvero beach vacationsbeach vacation homes veroms beach vacationsms beach vacationms beach condosmaui beach vacationmaui beach vacationsmaui beach clubbeach vacationsyour beach vacationscheap beach vacations

Unknown said...


Unknown said...

盲導犬ユーザーの皆さまが大手ECサイトで商品をご購入される際、グリーンクリックを経由していただくだけでECサイトから支盲導犬。グリーンクリックを利用することで、商品の購入価 格が変わったり、寄付の

Unknown said...


Unknown said...

国際協力ユーザーの皆さまが大手ECサイトで商品をご購入される際、グリーンクリックを経由していただくだけでECサイトから支国際協力。グリーンクリックを利用することで、商品の購入価 格が変わったり、寄付の

Unknown said...

結婚相談所 東京

Unknown said...

群馬 ハウスメーカー
埼玉 ハウスメーカー
自動車保険 比較

Unknown said...

自動車 保険 見積
東京 ホームページ制作

Unknown said...

知多半島 温泉
知多半島 旅館
埼玉 不動産
クレジットカード 海外旅行保険
広島 不動産
松山市 不動産
岡山 不動産

Anonymous said...

マンション 買取 1戸建て 査定 1戸建て 買取 SEO対策 福岡 賃貸 車買取 自動車保険 バイク買取 美容整形 労働問題 収益物件不動産売却などにはマンション査定土地売買1戸建て売却が含まれる。 物件探しは広島 不動産 岡山 不動産 松山市 不動産 香川県 不動産 徳島 不動産 高知 不動産 高松 不動産をフルカバーしてます大手で 和歌山 富山 滋賀 石川 山梨 新潟 沖縄 大分 鹿児島 宮崎 熊本 高知

Anonymous said...

不動産 投資 新築マンション インプラント 広島 引越し マンション 売却 不動産 査定 不動産 売買 広島 賃貸 システム開発 土壌汚染 クチコミ 土地 買取 不動産会社 ホームページ制作 賃貸 長野不動産富山不動産石川不動産福井不動産愛知不動産岐阜不動産三重不動産兵庫不動産滋賀不動産奈良不動産和歌山不動産鳥取不動産島根不動産山口不動産徳島不動産香川不動産愛媛不動産高知不動産佐賀不動産長崎不動産大分不動産宮崎不動産沖縄不動産 ホームページ制作 東京 原油 賃貸

Anonymous said...

不動産 買取 広島市 インプラント 不動産 賃貸 収益物件 マンション 売買 土地 売却 札幌 不動産 仙台 不動産 大阪 不動産 横浜 不動産 名古屋 不動産 福岡 不動産 京都 不動産 埼玉 不動産 千葉 不動産 静岡 不動産 神戸 不動産 浜松 不動産 堺市 不動産 川崎市 不動産 相模原市 不動産 姫路 不動産 岡山 賃貸 明石 賃貸 鹿児島 不動産 北九州市 不動産 熊本 不動産 投資 土地 査定 口コミ 青森不動産北海道不動産岩手不動産宮城不動産秋田不動産山形不動産福島不動産群馬不動産栃木不動産茨城不動産山梨不動産新潟不動産プレジデント

Anonymous said...

WoW Accountbuy wow gold,wow power leveling,Cheap WoW Accountwow gold,Hudson, Dunn declare free agencyworld of warcraft gold,cheap wow gold,world of warcraft power leveling,world of warcraft gold,buy wow gold,Buy WoW Accountbuy wow gold,wow power leveling,ffxi gil,ffxi gil,world of warcraft power leveling,World of Warcraft Account,sell wow gold,wow power level,wow gold for sale,power leveling,,wow power level,WoW Accounts for Sale, faith and creditwow gold for sale,power levelingwow power level,buy cheap wow gold.Gold

Anonymous said...


Anonymous said...

Cheap wow gold here!
Buy Wow Gold
wow power leveling
wow gold
Buy cheap Wow Gold
Game4power faq
wow gold cheap
warhammer gold
wow gold us
game news
wow support
wow gold news
wow power leveling
world of warcraft
wow tips
cheap wow gold
wow gold

Anonymous said...

Today, the Microsoft-owned in-game ad agency said that it has signed an exclusive multiyear agreement with Blizzard. Azerothians opposed to seeing in-game ads in their localworld of warcraft goldwatering holes need not worry, however, because the deal is limited to Blizzard's Web sites and,the game maker's online-gaming hub. Terms of the deal were not announced, but Massive did note that the agreement is applicable to users in the US, Canada, Europe, South Korea, and Australia.
buy wow gold

Massive also said today that it would be extending its aforementioned deal with Activision to encompass an additional 18 games appearing on the Xbox 360 and wow goldThe agency didn't fully delineate which would fall under this deal, though it did call out Guitar Hero: World Tour, James Bond: Quantum of Solace, and Transformers: Revenge of the Fallen,buy wow items as well as games in its Tony Hawk and AMAX Racing franchises.Shortly before Activision and Vivendi announced their deal of the decade,wow power levelingthe Guitar Hero publisher signed on to receive in-game advertisements from Massive Inc for a number of its Xbox 360 and PC games. A bit more than a year later, Massive is now extending its reach to Activision's new power player, Blizzard wow gold from our site ,you'll get more surprises!

Anonymous said...

What do you know Pirates of the Burning Sea Gold. And do you want to know? You can get potbs gold here. And welcome to our website, here you can play games, and you will get potbs Doubloon to play game. I know potbs money, and it is very interesting.Do you want a try, come and view our website, and you will learn how to buy potbs Doubloon. Come and join with us. We are waiting for your coming.

Do you know that the rappelz rupees? The players often forget to eat meal when they play the online games. In the game many palyers need the rappelz gold to up their levels. so they often search where can buy rupees, I think our website maybe is your best choice. Many friends told me that in here can get rappelz money, and here you can also relax yourself. so i hope more and more players come here to buy the cheap rappelz rupees.

wow power leveling said...

Gold key link for (wow power leveling) the law by all, such (wow power leveling) as bubble shadow dream (wow power leveling) hallucinations, such (power leveling) as exposed as well as (wow gold) electricity, should be (wow powerleveling) the case ZHANG GUANG TONG

Anonymous said...

aion chinaaion china gold,
aion cn goldaion chinese gold,
aion gold chinaaion gold chinese,
china aion goldchinese aion gold,
aion china kinaaion chinese kina,
aion kina chinachina aion kina,
aion china buybuy aion china,
aion chinese server

aion cn server

aion china server goldchina aion server gold,
chinese aion server

aion chinese server

aion cn server kinaaion china server kina,
china aion server kinachinese aion server kina

Anonymous said...

無料 出会い 競馬予想 無料 競馬予想 競馬予想 無料 競馬予想 無料 競馬予想 無料 近視 手術 メル友 出会い 出会い 出会い 出会い メル友 メル友 人妻 メル友 ギャンブル依存症 AV女優 無料 出会い 出逢い 掲示板 出会い系 無料 出会い 人妻 出会い 人妻 出会い セフレ 人妻 出会い セックスフレンド メル友 出会い SM 愛人 不倫 セフレ 無料 出会い 出会い系 無料 無料 出会い 富士山 写真 富士山 メル友 無臭性動画 カリビアムコム 一本堂 出会い 人妻 セフレ ハメ撮り エッチな0930 メル友 無料 出会い 無料 出会い セフレ セフレ セフレ セックスフレンド セックスフレンド セックスフレンド 人妻 出会い 人妻 出会い 人妻 出会い 出会い系 出会い系 出会い系 カリビアンカム カリビアンカム

Anonymous said...