Describing the Elephant - Adventures in Research Data Management





What does your research data look like, and what facilities would your researchers welcome to help them manage it?  Here are the first fruits of a survey of researchers at Loughborough.

[Header photo above - picture credit to Flickr user jimmiehomeschoolmom]

Many UK Universities are carrying out similar work right now.  This is to help them understand what they need to do in order to comply with the Research Council mandate to open up access to research data, which I've blogged about previously in Suddenly, Everything has Changed.

If this is all news to you, my talk from the Digital Curation Centre's event on Research Data Management earlier this year may help to set the context:


So like the proverbial blind men trying to describe the elephant, research data is all things to all people. We thought it would be helpful to probe a little more around several facets of our research data. Our survey has only been open for a few days, but we have already had 77 responses back. This blog post will be used to help promote the survey and to encourage further feedback.

Loughborough research folk, we'd love to hear from you - please feel free to complete the survey by clicking on this link.  Comments on this post are also very welcome, so please don't hold back!


Q1. Is it an elephant?

We wondered what sorts of formats of "data" people were actually working with, and asked them to tick the relevant box(es) to indicate whether their research data was: in Office document format; experimental samples such as genome sequences; audio/video media; computer based models (e.g. Computational Fluid Dynamics); offline (e.g. handwritten interview transcripts) or "other". Here's what they said...


Quite a bit of offline data there, and I think we will find lots of analogue audio and video recordings too as we start to dig a little deeper. There's either a digitization challenge akin to a mini Google Books project here, or we simply digitize on request - and maintain cataloguing metadata that makes it possible for people to learn of the datasets' existence.


Q2. Is it a herd?

We asked people to tell us how many research data sets they worked with - 10 or less, between 10 and 100, over 100, or over 1,000. You can see the results in the chart below. Note that these include both working data and data specifically supporting research publications.


Now, if you only had a handful of key datasets (roughly half of our respondents) then managing that metadata to describe them and make them findable ought not to be a huge challenge. What do we do about the 26% of respondents who have hundreds or even thousands of datasets?  A lot of these were computer based models, so there is some potential for automatically generated metadata here.  We have been aiming to trial a Research Data Repository akin to our Institutional Repository, but the survey responses raise some interesting questions about how we manage the process of creating and maintaining the metadata, let alone the data deposit.


Q3. How big are your elephants?

We asked people to tell us how much total storage their research data would require - less than 100MB, between 100MB and 1GB, between 1GB and 1TB, over 1TB and over 10TB. Note that these figures include both "working" data and data supporting formal research outputs. You can see the results below.


It's also notable that most of the respondents with large numbers of datasets actually had large numbers of large datasets, although several people had amassed over 1TB of small files such as Office documents. Examples of work with large datasets included Genomics and Computational Fluid Dynamics, as you might expect.


Q4. Can your elephants jump?

We asked people to tell us what proportion of their data directly supported research outcomes, with the results shown in the graph below:


Clearly a significant proportion of the data that people are working with is not directly relevant to research outcomes and reproducibility of results.  However you will see in the feedback section below that this is not the end of the story.


Qualitative Feedback

I noted at the top of this post that we are just feeling our way at the moment in this area, and the survey responses have been very helpful in this respect.  We also gave respondents a free text box should they wish to add qualitative feedback, comments or queries.

Here is a brief summary of the feedback we have received to date.  Much of this applies to the wider research data management agenda, rather than the specifics of how we approach RDM at Loughborough, and may be of wider interest:
  • Confidentiality requirements:
    • privacy and anonymity of experimental subjects
    • industrial and public sector partnerships such as our work with the NHS
  • Work would be required in some cases to anonymize e.g. interview transcripts
  • Permission may be required to share data that was not collected with public release in mind
  • Open data needs to be written into all new grant applications, project plans and experimental protocols - not added as an afterthought
  • Should we be aiming to open "all" data, or just the data required for reproducibility of published results?
  • Proprietary software is required in some cases to process data - this may not be generally available, and in some cases is commercial in confidence or even security classified
  • Software could be considered as "data" and in many cases will be required in order to reproduce published results - should researchers be encouraged to open source any software they develop?
  • Is it feasible to open all of the data supporting our REF2014 submissions?
  • Requirements around inter-institutional collaboration, and suitability of cloud services like Google Drive and Dropbox - availability outside institutional silos versus bandwidth/latency
  • Availability central storage for "big data" - a number of respondents felt that this would make an enormous difference to their research
  • Processing capability for "big data" - the University's existing HPC facilities can be used in extremis for crunching big data, but the software that exists to do this (e.g. Hadoop) is not a good match for the traditional HPC environment
  • Re-use of existing open data - strategies for citing, e.g. whether to make a local copy available to be sure of being able to satisfy open data mandate
  • Relationship to other initiatives, e.g. ESRC Data Archive, and arXiv's archive facility for data linked to publications
So, all in all a good start - with some very helpful feedback.  If you are a Loughborough reader and you haven't already completed the survey, please do.  For readers from other institutions, I'd welcome any comments (via the form below) on the points raised above and your own attempts to "describe the elephant".


Postscript

From the Wikipedia article on Blind Men and an Elephant, here's a neat twist on the parable:
Six blind elephants were discussing what men were like.
After arguing they decided to find one and determine what
it was like by direct experience.
The first blind elephant felt the man and declared -
“Men are flat.”
After all the blind elephants felt the man, they agreed.