What does your research data look like, and what facilities would your researchers welcome to help them manage it? Here are the first fruits of a survey of researchers at Loughborough.
[Header photo above - picture credit to Flickr user jimmiehomeschoolmom]
Many UK Universities are carrying out similar work right now. This is to help them understand what they need to do in order to comply with the Research Council mandate to open up access to research data, which I've blogged about previously in Suddenly, Everything has Changed.
If this is all news to you, my talk from the Digital Curation Centre's event on Research Data Management earlier this year may help to set the context:
Loughborough research folk, we'd love to hear from you - please feel free to complete the survey by clicking on this link. Comments on this post are also very welcome, so please don't hold back!
Q1. Is it an elephant?
We wondered what sorts of formats of "data" people were actually working with, and asked them to tick the relevant box(es) to indicate whether their research data was: in Office document format; experimental samples such as genome sequences; audio/video media; computer based models (e.g. Computational Fluid Dynamics); offline (e.g. handwritten interview transcripts) or "other". Here's what they said...
Q2. Is it a herd?
We asked people to tell us how many research data sets they worked with - 10 or less, between 10 and 100, over 100, or over 1,000. You can see the results in the chart below. Note that these include both working data and data specifically supporting research publications.
Q3. How big are your elephants?
We asked people to tell us how much total storage their research data would require - less than 100MB, between 100MB and 1GB, between 1GB and 1TB, over 1TB and over 10TB. Note that these figures include both "working" data and data supporting formal research outputs. You can see the results below.
Q4. Can your elephants jump?
We asked people to tell us what proportion of their data directly supported research outcomes, with the results shown in the graph below:
Qualitative Feedback
I noted at the top of this post that we are just feeling our way at the moment in this area, and the survey responses have been very helpful in this respect. We also gave respondents a free text box should they wish to add qualitative feedback, comments or queries.
Here is a brief summary of the feedback we have received to date. Much of this applies to the wider research data management agenda, rather than the specifics of how we approach RDM at Loughborough, and may be of wider interest:
Postscript
From the Wikipedia article on Blind Men and an Elephant, here's a neat twist on the parable:
Clearly a significant proportion of the data that people are working with is not directly relevant to research outcomes and reproducibility of results. However you will see in the feedback section below that this is not the end of the story.
Qualitative Feedback
I noted at the top of this post that we are just feeling our way at the moment in this area, and the survey responses have been very helpful in this respect. We also gave respondents a free text box should they wish to add qualitative feedback, comments or queries.
Here is a brief summary of the feedback we have received to date. Much of this applies to the wider research data management agenda, rather than the specifics of how we approach RDM at Loughborough, and may be of wider interest:
- Confidentiality requirements:
- privacy and anonymity of experimental subjects
- industrial and public sector partnerships such as our work with the NHS
- Work would be required in some cases to anonymize e.g. interview transcripts
- Permission may be required to share data that was not collected with public release in mind
- Open data needs to be written into all new grant applications, project plans and experimental protocols - not added as an afterthought
- Should we be aiming to open "all" data, or just the data required for reproducibility of published results?
- Proprietary software is required in some cases to process data - this may not be generally available, and in some cases is commercial in confidence or even security classified
- Software could be considered as "data" and in many cases will be required in order to reproduce published results - should researchers be encouraged to open source any software they develop?
- Is it feasible to open all of the data supporting our REF2014 submissions?
- Requirements around inter-institutional collaboration, and suitability of cloud services like Google Drive and Dropbox - availability outside institutional silos versus bandwidth/latency
- Availability central storage for "big data" - a number of respondents felt that this would make an enormous difference to their research
- Processing capability for "big data" - the University's existing HPC facilities can be used in extremis for crunching big data, but the software that exists to do this (e.g. Hadoop) is not a good match for the traditional HPC environment
- Re-use of existing open data - strategies for citing, e.g. whether to make a local copy available to be sure of being able to satisfy open data mandate
- Relationship to other initiatives, e.g. ESRC Data Archive, and arXiv's archive facility for data linked to publications
Postscript
From the Wikipedia article on Blind Men and an Elephant, here's a neat twist on the parable:
Six blind elephants were discussing what men were like.
After arguing they decided to find one and determine what
it was like by direct experience.
The first blind elephant felt the man and declared -
“Men are flat.”
After all the blind elephants felt the man, they agreed.
No comments:
Post a Comment