Tuesday, February 22, 2011

Data Archival and the JVP

It finally happened - Journal of Vertebrate Paleontology has taken a few more tentative steps into the 21st century! Both in an editorial in the most recent issue (note: full text is paywalled), as well as in an updated version of the instructions to authors, the journal has announced a formal data archiving policy.

What does this mean?
Quoting from the JVP's new instructions to authors, "all data files needed to replicate phylogenetic or statistical analyses published in the journal should be made accessible via the JVP website as online supplementary material." In other words, if you analyzed numbers of any sort, you need to show your source data. This includes cladistic matrices (publication of these is already standard practice) as well as measurements or other data used in statistical analyses. Additional kinds of data - for instance, extraneous measurements unrelated to the study, raw field notes, or raw CT scans - are not included in this proposal (even if it's good scientific practice to make sure this information is available for posterity).

Why is this a good thing?
  • Data archival allows others to build upon previous work more easily. For instance, let's say I publish a statistical analysis of molar size in the early horses Mesohippus bairdi and Mesohippus westoni. Maybe there is another worker out there who wants to look at variation in some other Mesohippus species. If my dataset is available, it is much easier for another research to quickly advance beyond my work (assuming they trust my data, of course - see below).
  • Data archival allows new and unexpected uses for data (thus increasing citations). My p-values and arithmetic means of Mesohippus teeth are interesting, but not that useful outside the context of my paper. If I publish the raw data, though, other individuals can use these data (and cite my paper) in all other sorts of contexts. Maybe someone wants to throw the data in her study of horse tooth evolution (hey, it's another citation!). Maybe someone else is interested in Oligocene herbivore ecology as evidenced in molar properties (and there's another citation!).
  • Data archival ensures transparency. Everyone makes statistical or analytical mistakes. Unfortunately, these mistakes may render the results of a paper highly suspect at best, or worthless at worst. With the availability of raw data, it is much easier for someone to reproduce a study or correct misuse of statistics. (as a case study from my own work, I discovered that nearly all paleopathology studies in the literature were using incorrect statistical assumptions - and a reanalysis of the data forced some new interpretations!) Additionally, taxonomy frequently changes, meaning that previous categories applied in an analysis are hopelessly outdated. Not so, if you can go back to the author's original data, make a few corrections, and rerun the analysis!
  • Publicly funded research deserves to be public. So much of paleontology research is funded by government grants, or conducted on company time. It is not a good use of our limited resources to keep data locked up after the original study has been published. This is somewhat analogous to writing an NSF grant to collect fossils for one's personal collection. Why should data be any different?
Answers to some common objections
  • "I have other plans for the data." Some researchers want a monopoly on their data. They have this fear in the back of their head that someone is going to go out and do exactly the same next step study planned by the original researcher. I have several responses to this. First. . .really? Second, I would remind authors that it is bad science (perhaps even unethical) to publish research results that are not transparent to scrutiny. Third, I would remind authors that they are never obligated to publish all of the tangential data. If you are publishing a paper on dentary lengths in hadrosaurs, you don't have to release the data on predentary dimensions too! Finally, I would remind authors that this is just a lame excuse to put off their own follow-up research. We all know the stories of this or that researcher who has sat on a dataset for years. Science is not being helped by keeping those data secret.
  • "Interested researchers can just contact the authors." As an example of why this is a bad idea, please refer to the work of Leonard Radinsky. He published a number of wonderful morphometric studies of fossil mammals, clearly based on hundreds of measurements. But, he also passed away in 1985. Unless you have a Ouija board that actually works, it's highly unlikely that anyone will be able to exactly reproduce the results in his oft-cited "Ontogeny and phylogeny in horse evolution." Authors leave academia, pass away, or lose their data sheets all of the time. It's a pipe dream to assume that "data are available upon request." [to be fair to Radinsky, his paper did not indicate that the data were available - I just chose it as one prime example where the data are probably irrecoverable]
  • "It just encourages lazy research by data miners, because you should never trust anyone else's research data." There is a grain of truth in this - inter-observer error may creep into measurements, and maybe a certain author likes to measure plaster reconstructions. But once again, this is just a lame excuse for lazy research by the person who is objecting to data transparency! After all, if you can't trust the data, you can't trust the paper, so what's the point in publishing? It's a slippery slope. The benefits far outweigh the drawbacks.
  • "It's just more work for the authors." This too falls into the "lame excuse" category. If you've already gone to the trouble to put together an Excel spreadsheet for your statistical analysis, you can spend an extra 10 seconds transferring those data to the manuscript submission system. If it takes you longer than that, you may want to reconsider your data management practices.
Recommendations for JVP
I have just a handful of recommendations for the editors at JVP, based on my own experience as both a data user and a data generator. Some of these suggestions may already be incorporated, and others may be planned. Others may be impractical at this time. Either way, I think it is helpful to consider the following:
  • Make sure the data files are in a usable format. Historically, supplemental information at JVP has been launched as PDF files (with some NEXUS files). This is great for casual reading, but horrible for analysis. Just try copying 3,382 measurements from a PDF table into an Excel spreadsheet, and you'll see what I mean. This does not mean you need to choose a single format - why not have the data in PDF, Excel, and raw text? Multiple formats ensure maximum usability of the data across multiple platforms (as well as flexibility in the face of future software upgrades).
  • Consider a data embargo for reluctant authors. Many journals allow a six month or (maximum) one year embargo on supplemental data, to allow authors the chance to finish up any outside projects. Although I philosophically disagree with this option, I see its utility. And, it is an appropriate compromise between protecting author rights and protecting scientific integrity.
  • Consider partnering with DRYAD or a similar data repository. A number of other evolutionary societies are doing this - why shouldn't SVP be a part of this?
  • Solicit society input. The members of SVP and the authors of JVP probably have some great thoughts on what they would like to see in data archival. Why not solicit input from the community to find out what the community needs? This will only solidify ownership of the data archival efforts by paleontologists!
  • Check out a recent publication on this very topic. Michael Whitlock recently published a great review article [paywall] on best practices in data archival - many of the points mentioned above are contained there. (thanks to Randy Irmis for passing the link along)
The Last Word
All in all, I am pleased to see JVP take these steps. Congratulations to the editors of the journal, for taking this stand for good science!

More Reading
Berta, A., and Barrett, P. M. 2011. Editorial. Journal of Vertebrate Paleontology 31: 1. doi:10.1080/02724634.2011.546742 [paywall]

JVP Instructions to Authors [link to pdf]

Whitlock, M. C. 2011. Data archiving in ecology and evolution: best practices. Trends in Ecology & Evolution 26: 61-65. doi:10.1016/j.tree.2010.11.006. [paywall]

10 comments:

Ashley said...

Thank GOODNESS. Maybe by the time I hit grad school they'll have their stuff completely together!

Mike Keesey said...

"Make sure the data files are in a usable format."

Hear, hear! I am sick of seeing tables in PDFs.

Captain Fossil said...

Agree completely. This is a great step forward for JVP!

Ross Mounce said...

I too am overjoyed that this has happened!

I've been campaigning myself for a little while for this to happen. Even given an entire talk about it at the Young Systematists' Forum (SystAss, UK) www.bit.ly/phylodata .

However, this is but one journal. What we really need is *ALL* journals to do this: Zoo J Linn, Cladistics, PloS, APP, Palaeontology, Canadian J Earth Sci, JSP, Pal Zeit, Vert PalAsiatica, J Mamm Evol, Compt Rendus Palevol, Geobios... the list goes on, endlessly!

It's great that some of the big journals have done the right thing. It's high-time the other journals follow suit. In this day and age, archiving data just isn't that difficult - I see few excuses NOT to!

Andy said...

Thanks for the link, Ross. I agree that even more journals need to get on board. . .I suspect it's largely a matter of authors making their needs known directly to the editors of the journals. Maybe someone should coordinate an email campaign?

Ross Mounce said...

An email campaign would be nice and if you wanted to start one, with a coalition of co-signatories(?), I'd be happy to help. But even if one targeted say Taxacom, DML, VRTPALEO, PalAss (UK), SystAss (UK)... - that still won't reach vast swathes of the 'data producing' community. I think perhaps we need more published-papers (e.g. your Nature correspondence http://dx.doi.org/10.1038/4611053b) and conference talks.

I'm planning to give a follow-up talk to my YSF12 one at the Systematics Association biennial meeting this year. Might well raise the issue with Willi Hennig Society (Cladistics) people too, although less formally. Hopefully this should raise awareness of the issues to a wider, broader audience.

[To everyone:] Keep up the good work! Things *are* changing for the better :)

Mike Taylor said...

Great news indeed! I am delighted to see this important step forward.

A wonderful addition would be deposition of full-resolution figures as submitted for publication -- there's nothing more frustrating than submitting a 600 dpi figure, then seeing is squished down to 100 dpi in the PDF.

Also: *headdesk* on the editorial being paywalled.

Eric Morschhauser said...

I will second Mike Taylor's frustration. Formatting images to PDF is a good start, but a bad finish. They currently require all supplimentary files to begin as word files (awful for images) and to conform to the print journal's formatting. Since we're talking about data that will never be archived as printed material, why not relax the requirements so images can be seen at full size and in a selection of recognizable file formats? File size shouldn't be an issue considering how inexpensive data storage has become.

That being said, the editors do deserve congratulations for taking the journal in the correct direction.

Martin Brazeau said...

There are too many office cabinets out there to which one could mount a highly productive expedition. The arguments against "I have other plans for the data" apply just as well (or more appropriately) to the fossil materials themselves. I think this needs to be augmented to include figured material in institutional or public trust. I'm sorry, but type and figured material in public/institutional collections are in the public domain and should be accessible to all researchers.

It's time to put an end to the 20-year loan.

Ross Mounce said...

Just in case anyone hasn't seen it yet - I finally got around to drumming up some support for data archiving in the wider paleo-community: http://supportpalaeodataarchiving.co.uk/

>125 signatures of support so far!
add yours if you haven't already...