Back to the Future - Resource Discovery, Revisited

This post is prompted by recent Twitter discussions involving some of my co-conspirators from the 1990s, back when I was a digital libraries researcher, web caching poohbah and some-time shambrarian.

We had a particular view that Internet search would (or perhaps "should") evolve along the same lines as the Internet itself - a many faceted distributed and decentralized networked organism, held together by a common web of protocols and interchange formats. This had worked pretty well for TCP/IP, after all, so why not take the same approach for finding stuff?

Bear in mind that WAIS and Gopher were still alive and kicking at this point...

Acronym Soup

Jill Foster's excellent Networked Information Retrieval RFC gives some good background reading on the prevailing ethos of the 90s - a trip down memory lane for some. For the uninitiated - an introduction to a bewildering world of Archies, Gophers, and Veronicas. You can also get a feel for the thinking of the time from the Models Information Architecture diagram below. This was part of an study for the Renardus project which I contributed to back in 2000. MIA begat Andy Powell's infamous DNER diagram and subsequent work on the JISC Information Architecture.


Models Information Architecture - from the Renardus resource discovery study (2000)

You might note the date here - we wrote this report in the same year that the infant Google announced it had indexed a billion web pages. I'll come back to the elephant in the room in a moment, but first...

The approach I've described above escaped out of the labs, and came to power a number of national services. In the UK the Follett Report had recommended pump priming for a major initiative to support use of IT in libraries and in the mid 1990s a number of projects and services were funded under the resulting eLib programme, led by the redoubtable Chris Rushbridge.

The eLib Years

With some encouragement by Lorcan Dempsey (then of UKOLN) I became fairly deeply involved in the Subject Gateways strand of eLib. This concept was an outgrowth of the Social Sciences Information Gateway (SOSIG), developed around 1994 by Nicky Ferguson, Debra Hiom et al at the ILRT in Bristol, with technical assistance from a mysterious character we shall refer to only as "Jim'll" :-)

The goal of the service was to use expert help to assist people in navigating the minefield of the nascent Internet. This was done through training and awareness raising, and also by using human beings to catalogue "quality" Internet resources that University staff and students might want to investigate. SOSIG users could browse through subject categories or search the catalogue entries. A number of these subject gateways  were set up to specialise in particular areas, e.g. OMNI for medical information and EEVL for engineering. These services were brought together in 2006 as Intute (see below).

The Intute Service (2011) - captured for posterity

Behind the scenes the Subject Gateways used a small army of experts (mostly librarians) to catalogue websites, in the same way that they might catalogue books in a library. Jim'll and I developed a piece of web based software called "ROADS" to automate much of this work, as described in this paper for D-Lib magazine co-written with John Kirriemuir and Dan Brickley (then of the ILRT), and Sue Welsh (then of OMNI). ROADS was a punt on the likely standards for Internet search and information interchange, based on the IETF's WHOIS++ protocol work. ROADS explicitly permitted metadata from different services to be aggregated, and provided facilities for multiple services to be searched efficiently in a single query using a summary of the indexed data known as a centroid.

From Cataloguing to Indexing

We followed up on this work by integrating the Z39.50 targets beloved of the digital libraries community. The figure below shows the excellent work by Peter Valkenburg (then of TERENA) on a cross-search user interface in our TERENA funded CHIC-Pilot project.

CHIC-Pilot sought to bring together the human assisted catalogue data from the subject gateways with robot based indexes created using the Harvest software and a Z39.50 based equivalent developed by Sebastian Hammer at Index Data and Sigfrid Lundberg, Mattias Borell et al at Lund University. The goal was for people to be able to search a European Web Index of academic sites and also the human assisted subject gateway catalogues through a friendly common front end.

CHIC-Pilot search interface

And yes, CHIC-Pilot had its own version of the MIA diagram...

CHIC-Pilot architecture diagram

In the UK a number of us had been running regional web indexers using the DARPA funded Harvest software. These "gatherers" crawled their local websites and then exported their indexes in Summary Object Interchange Format (SOIF) to a central indexer and search engine ("AC/DC") run by Dave Beckett (then of the University of Kent). The rationale behind AC/DC is discussed in an article for Ariadne magazine by Dave and Neil Smith (then of HENSA), and shown in the figure below.



The Bubble Bursts

As it happened, our punt didn't work out - WHOIS++ didn't take off, and LDAP became the lingua franca of directory services through its adoption by Microsoft (Active Directory) and Novell (e-Directory). More importantly, these organizational LDAP services were typically firewalled to the Nth degree from the Internet, and the LDAP protocol was only very rarely used for anything other than directory services.

Although the Internet search aspects of the Harvest project died a death after a couple of years of curation as an open source community project, the web caching component of the software drew widespread acclaim and begat both the open source Squid cache engine, and the Netcache appliance from NetApp. I became a heavy Squid user and contributor, with a particular interest in Squid's inter-cache communications protocol (ICP) and then later its cache digest protocol. These two technologies revived the promise of a distributed, decentralised web of services, albeit in a slightly different context.  ICP let one cache query another for a URL, and cache digests let a cache share a summary of its contents with another cache, using a lossy summary known as a Bloom filter. We subsequently went on to build meshes of caches spanning whole countries, with Europe being a particular hotbed thanks to the efforts of John Martin (then at TERENA).

Ironically given the vision of the Internet I had become wedded to, my interest in the technology of web caching was to lead to a mini-career as a caching expert running a huge centralised service. This all stemmed from a report on caching on JANET written in 1996 by Andrew Cormack (then at Cardiff). Andrew's report led to the establishment of the JANET Web Cache Service, which we and Manchester Computing successfully tendered for. There followed a period of some five years during which the exigencies of running a production service relied upon by most of the Higher Education sector took precedence over my research interests (Translation: there goes the PhD :-)

Web Caches R Us

This was all happening in tandem with JANET's introduction of charging for international (or more properly, US related) traffic, as a very British way of influencing institutions' use of bandwidth. Consequently, JANET connected institutions  had a very strong incentive to send as much as possible of their traffic through the JWCS, and indeed this was the behaviour we saw when we analysed the statistics from the JANET international connections, as shown in the figure below.


Around the start of the new millenium the rise of peer-to-peer filesharing protocols meant that caching was no longer an effective way of policing international charging. And with the bursting of the dot-com bubble, the cost of trans-Atlantic bandwidth had fallen through the floor. The figure below shows the point at which peer-to-peer really started to make its presence felt for us at Loughborough, towards the end of the year 2000. The purple spike is P2P traffic.
It may be interesting to reflect that the JANET caches performed very well in their heyday, due no doubt to the large proportion of static content on the web in the late 90s - remember that AJAX had yet to be invented. An example is shown in the figure below, where the yellow area indicates local institutional cache hits, and the green indicates national (JWCS) cache hits - the red is the remaining web traffic which was either uncacheable material or material not already cached. Of course this principle lives on in quite a different form (user and ISP intervention not required) in the Content Distribution Networks, notably Akamai. I was pleased that one of my last acts as a JWCS person was to put the right people in touch with each other to get a Akamai node installed at ULCC to service JANET users.
The JWCS era was really a formative period for me, and I'm indebted to the folk who were in involved in this work. Thanks again to George Neisser, John Heaton (now sadly deceased), Michael Sparks, Richard Hanby, Andrew Veitch, Graeme Fowler and Matthew Cook for a very stimulating few years.

Now that organizational proxy cache servers are no longer in widespread use, metrics such as those shown above are harder to come by. My intuition is that with the changes in the funding model for higher education in the UK, usage based charging is likely to rear its head again at some point - perhaps through a tiered model of JANET subscription modelled on the 3G data policies in mobile phone contracts.

Back to the Future

So where does this leave my dream of Internet resource discovery?

Whilst the work I've described above is now pretty obscure, I'd argue that it has done a great job of pointing us in the right direction. The ideas and ethos of this era subsequently found their way into many important and influential places, e.g. through Dan's work at the W3C and Joost, Dave's move to Yahoo!, Lorcan's move to OCLC, and Michael's work for the BBC. [I stayed at Loughborough, but this is quite an unusual institution that often has the feel of working for a high tech start-up about it :-]

I had envisaged a distributed and decentralized approach to search, where individuals and organizations would choose to make selected metadata available to multiple aggregators and the punters would use a variety of search and retrieval protocols to target particular collections, subject classifications etc. In many ways the work I've described above prefigured FOAF and social network graphs, RDF, SPARQL and Linked Data more generally (Peter and Dan have a key paper here) - and let's not forget OAI-PMH for Repository folk.

Perhaps more importantly, I had assumed that the inner workings of these services would be largely exposed to the punter, who would be able to mix and match services to their taste. How wrong I was! We can only catch a glimpse of the innards of key Internet services through a lens (Hadoop) darkly, and attempts to carry out relevant research are somewhat stymied by the sheer scale at which the key players now operate. Try to replicate this in a University Computer Science department, and see how far you get! RSS is probably the closest thing to my original vision, and even then (as Pat Lockley has noted) there are a lot of problems around schema consistency and proprietary extensions.

I can take some consolation that in my semi-abandoned PhD research I had a novel idea about using IP multicast in search and retrieval which I've yet to see appear independently anywhere else. Although multicast for IPv4 never achieved widespread deployment, this may something worth revisiting as we finally migrate to IPv6...