Tuesday, February 22, 2011

Data Archival and the JVP

It finally happened - Journal of Vertebrate Paleontology has taken a few more tentative steps into the 21st century! Both in an editorial in the most recent issue (note: full text is paywalled), as well as in an updated version of the instructions to authors, the journal has announced a formal data archiving policy.

What does this mean?
Quoting from the JVP's new instructions to authors, "all data files needed to replicate phylogenetic or statistical analyses published in the journal should be made accessible via the JVP website as online supplementary material." In other words, if you analyzed numbers of any sort, you need to show your source data. This includes cladistic matrices (publication of these is already standard practice) as well as measurements or other data used in statistical analyses. Additional kinds of data - for instance, extraneous measurements unrelated to the study, raw field notes, or raw CT scans - are not included in this proposal (even if it's good scientific practice to make sure this information is available for posterity).

Why is this a good thing?
  • Data archival allows others to build upon previous work more easily. For instance, let's say I publish a statistical analysis of molar size in the early horses Mesohippus bairdi and Mesohippus westoni. Maybe there is another worker out there who wants to look at variation in some other Mesohippus species. If my dataset is available, it is much easier for another research to quickly advance beyond my work (assuming they trust my data, of course - see below).
  • Data archival allows new and unexpected uses for data (thus increasing citations). My p-values and arithmetic means of Mesohippus teeth are interesting, but not that useful outside the context of my paper. If I publish the raw data, though, other individuals can use these data (and cite my paper) in all other sorts of contexts. Maybe someone wants to throw the data in her study of horse tooth evolution (hey, it's another citation!). Maybe someone else is interested in Oligocene herbivore ecology as evidenced in molar properties (and there's another citation!).
  • Data archival ensures transparency. Everyone makes statistical or analytical mistakes. Unfortunately, these mistakes may render the results of a paper highly suspect at best, or worthless at worst. With the availability of raw data, it is much easier for someone to reproduce a study or correct misuse of statistics. (as a case study from my own work, I discovered that nearly all paleopathology studies in the literature were using incorrect statistical assumptions - and a reanalysis of the data forced some new interpretations!) Additionally, taxonomy frequently changes, meaning that previous categories applied in an analysis are hopelessly outdated. Not so, if you can go back to the author's original data, make a few corrections, and rerun the analysis!
  • Publicly funded research deserves to be public. So much of paleontology research is funded by government grants, or conducted on company time. It is not a good use of our limited resources to keep data locked up after the original study has been published. This is somewhat analogous to writing an NSF grant to collect fossils for one's personal collection. Why should data be any different?
Answers to some common objections
  • "I have other plans for the data." Some researchers want a monopoly on their data. They have this fear in the back of their head that someone is going to go out and do exactly the same next step study planned by the original researcher. I have several responses to this. First. . .really? Second, I would remind authors that it is bad science (perhaps even unethical) to publish research results that are not transparent to scrutiny. Third, I would remind authors that they are never obligated to publish all of the tangential data. If you are publishing a paper on dentary lengths in hadrosaurs, you don't have to release the data on predentary dimensions too! Finally, I would remind authors that this is just a lame excuse to put off their own follow-up research. We all know the stories of this or that researcher who has sat on a dataset for years. Science is not being helped by keeping those data secret.
  • "Interested researchers can just contact the authors." As an example of why this is a bad idea, please refer to the work of Leonard Radinsky. He published a number of wonderful morphometric studies of fossil mammals, clearly based on hundreds of measurements. But, he also passed away in 1985. Unless you have a Ouija board that actually works, it's highly unlikely that anyone will be able to exactly reproduce the results in his oft-cited "Ontogeny and phylogeny in horse evolution." Authors leave academia, pass away, or lose their data sheets all of the time. It's a pipe dream to assume that "data are available upon request." [to be fair to Radinsky, his paper did not indicate that the data were available - I just chose it as one prime example where the data are probably irrecoverable]
  • "It just encourages lazy research by data miners, because you should never trust anyone else's research data." There is a grain of truth in this - inter-observer error may creep into measurements, and maybe a certain author likes to measure plaster reconstructions. But once again, this is just a lame excuse for lazy research by the person who is objecting to data transparency! After all, if you can't trust the data, you can't trust the paper, so what's the point in publishing? It's a slippery slope. The benefits far outweigh the drawbacks.
  • "It's just more work for the authors." This too falls into the "lame excuse" category. If you've already gone to the trouble to put together an Excel spreadsheet for your statistical analysis, you can spend an extra 10 seconds transferring those data to the manuscript submission system. If it takes you longer than that, you may want to reconsider your data management practices.
Recommendations for JVP
I have just a handful of recommendations for the editors at JVP, based on my own experience as both a data user and a data generator. Some of these suggestions may already be incorporated, and others may be planned. Others may be impractical at this time. Either way, I think it is helpful to consider the following:
  • Make sure the data files are in a usable format. Historically, supplemental information at JVP has been launched as PDF files (with some NEXUS files). This is great for casual reading, but horrible for analysis. Just try copying 3,382 measurements from a PDF table into an Excel spreadsheet, and you'll see what I mean. This does not mean you need to choose a single format - why not have the data in PDF, Excel, and raw text? Multiple formats ensure maximum usability of the data across multiple platforms (as well as flexibility in the face of future software upgrades).
  • Consider a data embargo for reluctant authors. Many journals allow a six month or (maximum) one year embargo on supplemental data, to allow authors the chance to finish up any outside projects. Although I philosophically disagree with this option, I see its utility. And, it is an appropriate compromise between protecting author rights and protecting scientific integrity.
  • Consider partnering with DRYAD or a similar data repository. A number of other evolutionary societies are doing this - why shouldn't SVP be a part of this?
  • Solicit society input. The members of SVP and the authors of JVP probably have some great thoughts on what they would like to see in data archival. Why not solicit input from the community to find out what the community needs? This will only solidify ownership of the data archival efforts by paleontologists!
  • Check out a recent publication on this very topic. Michael Whitlock recently published a great review article [paywall] on best practices in data archival - many of the points mentioned above are contained there. (thanks to Randy Irmis for passing the link along)
The Last Word
All in all, I am pleased to see JVP take these steps. Congratulations to the editors of the journal, for taking this stand for good science!

More Reading
Berta, A., and Barrett, P. M. 2011. Editorial. Journal of Vertebrate Paleontology 31: 1. doi:10.1080/02724634.2011.546742 [paywall]

JVP Instructions to Authors [link to pdf]

Whitlock, M. C. 2011. Data archiving in ecology and evolution: best practices. Trends in Ecology & Evolution 26: 61-65. doi:10.1016/j.tree.2010.11.006. [paywall]