Peak Data - Fracking Knowledge from the Digital Exhaust



If data is the new oil, then how do we go about extracting and refining it? And will it come out easily, or will we need to resort to "data fracking"? I'll talk here about my favourite idea that came out of the recent Jisc think tank meeting on Big Data and Analytics, which is all about setting up mechanisms for data and potentially code sharing across academia, government and industry. I'm also setting you up to find out more about the literal Digital Exhaust.

[Picture credits Rob Englebright, reproduced with permission]

Peak Data

First off, the data as oil analogy is somewhat flawed as open data is presently still in relatively short supply. Furthermore, organizations aren't always attuned to the possibilities that accrue from strategic use of open data. Hence extraction can be problematic - cue those fracking comparisons. But in many ways we are building up to Peak Data, whereas the petrochemical industry is (we are told) already starting to contract after Peak Oil (shale gas notwithstanding).

We may envisage a world where data is more widely available. This might arise as a byproduct of government funded projects and agencies routinely sharing their data holdings - encouraged in recent policy statements from RCUK and other funders in the UK. I have written about these changes in Suddenly, Everything has Changed and Describing the Elephant. But this is a very education sector centric view - industry also releases open data and enters into (more restrictive but still...) collaborative agreements with peers, supply chain partners and even competitors.

If that last point sounds unrealistic, consider the eTRIKS project. This brings together a number of University partners with most of the big pharma firms including Roche, Janssen, GlaxoSmithKline, Bayer, Pfizer and Merck. The goal of this work is to build a common platform for pre-competitive  translational research around the open source tranSMART software. For a whistlestop tour of eTRIKS, please see this presentation from AstraZeneca's Ian Dix:



Another notable example is the CFMS HPC facility in Bristol, started up by a consortium of major firms including Airbus, BAe Systems, Rolls Royce and Williams F1. Could everyone come together to agree on a common supercomputing platform? Eight years on and CFMS is going strong - and has been in many ways an inspiration for the work that we are now doing with the HPC Midlands supercomputing shared service. Here are a few slides on CFMS from their former General Manager, Michael Davies:



Whilst both eTRIKS and CFMS have benefitted from some public funding, it is easy to see that the firms involved already have the resources to collaborate. Are they simply taking advantage of the opportunity to cream some money off the taxpayer, as a cynic might say, or are there genuine opportunities for the education sector to act as a catalyst? I believe the answer is very much the latter, if we are prepared to rethink our existing mental models and embrace these wider collaborations - codified here at Loughborough as Research that Matters.


The Digital Exhaust

As a former networking researcher (described in Back to the Future - Resource Discovery, Revisited) I was often stymied by the difficulty of getting access to raw data to test my hypotheses. Ironically, I ended up becoming an infrastructure geek largely as a byproduct of this desire to "get the data", but then found that infrastructure work simply expands to fill all the available time, and then some more again.

Picking a data point at random from one of the firms I have already mentioned, how likely is it that Rolls Royce are able to effectively analyse all of the ACARS digital telemetry from their recent engines? We have heard a lot about the Digital Exhaust, but this is the real thing. Are there opportunities here to improve engine performance, provide feedback to pilots on flying styles, and to ground service engineers on optimal settings? You bet.

There has been a somewhat febrile atmosphere recently around whether we view (say) jet engine telemetry, GP patient records, the 100K Genomes from Genomics England or even institutional Virtual Learning Environment click trails as "Big Data". In truth, all of these things are part of our wider digital exhaust, a mixture of public and private data - open, closed, and "semi-open" through the likes of Twitter's recent API changes which now require authentication and all API applications to be centrally registered.

Should I reasonably expect to be able to mash up my genome, records of GP and hospital consultant visits and results of blood based biomarker tests against interventions and outcomes with the larger UK population, environmental test readings and other epidemiological information from my community? Absolutely. How do we go about that? This is where I think Jisc is uniquely placed...


Jisc as Innovation Catalyst

What can Jisc add to the mix? There are some particular angles that being a national cross sector organization adds. Jisc is particularly well placed to provide a stimulus to encourage the wider sharing of data about learning journeys and outcomes beyond that required by central government for statistical purposes like Key Information Sets, but I think it also has a broader role through using its status and reputation to bring key players to the table. More on this later.

First, though a Key question: To what extent do top down initiatives like KIS produce data that is actually useful and used by prospective students? What if we were going to enormous lengths to collect the wrong data, as Ranjit Sidhu asserted in his talk for IWMW 2013? By pure chance, this week's ODIfridays lecture at the Open Data Institute covers exactly this ground, so do go along if you are in London and free over lunch. [For readers from the future, the date in question is July 19th 2013]

There is a broader context, which follows from the HESA/HEFCE HE Information Landscape work: institutions don't just have to produce the KIS return. In fact many Universities have several hundred statistical returns to complete every year. These include numerous statutory returns, and a wealth of data required for certifications by professional bodies. The diagram below, taken from the Information Landscape study final report, shows how these requirements break down:


Jisc is ideally placed here to get the relevant stakeholders around the table to thrash our a smaller set of statistical returns, if not core HE and FE sets. This would potentially save the sector many millions of pounds per year through the reduced cost to institutions of compliance.

Jisc can also add value around the negotiation of sector wide agreements such as the work done recently on the JANET Cloud Brokerage, model agreements for cloud services, and peerings with industry giants like Google, Microsoft and Amazon - and in the future hopefully the likes of our collaborators E.ON and Rolls Royce.

Does it also make sense to be thinking about procuring shared services on behalf of the sector? You might think that Universities and Colleges have a whole host of unique requirements that would frustrate this, but the reality is that over 50% of the HE sector have independently chosen the same timetabling system, student records system, and so on. See the UCISA CIS Survey for the full results - I've excerpted a couple of examples below.




Does this have the potential to save individual institutions and the sector overall a significant sum of money through reduced operating costs? You bet. Would suppliers be interested in participating? quite possibility - each OJEU scale procurement costs them tends of thousands of pounds to respond to. Might this proposal encounter significant institutional level resistance and require intervention from HEFCE, BIS etc to drive forward? Quite likely!

I've also touched in the past on a lighter touch version of this, in my Elevator Pitch for code.ac.uk - the concept here is that Jisc puts relatively small amounts of money into encouraging institutions to clean up, document and share their code for interfacing key corporate systems. You could think of it as a cheap and cheerful stepping stone to the Enterprise Service Bus being developed as part of the JISC Advance Nexus project. I don't see this happening without some sort of contractual relationship in place.


Playing Devil's Advocate
So we can see a role for Jisc around negotiating sector wide deals for the likes of metadata schemas for statistical returns and procuring or developing shared systems and services. These activities might seem to be something of a departure from the activities of the old JISC, which tended to take a "thousand flowers bloom" approach to innovation. However, we could ask ourselves to what extent should Jisc be undertaking "research"? We have research councils for that, after all...

Of the areas I've discussed, analytics in particular is a hard nut to crack, because there are now a number of products on the market around business intelligence, key performance indicators, and especially teaching and learning - including our own Co-Tutor student relationship management system (of which more anon), and offerings from the likes of Blackboard and Ellucian (Course Signals).

I would dearly like to see Jisc using its thought leadership role to act as a catalyst in discussions that bring academia, government and industry together, such as those taking place around the Big Innovation Centre. However, I would also ask a litmus test question for any proposed projects and programmes: Will it happen anyway? (and if so, how would Jisc involvement accelerate the process?)


Martin Hamilton

Martin Hamilton works for Jisc in London as their resident Futurist.