The Open Source Paleontologist: data

Showing posts with label data. Show all posts

Tuesday, February 22, 2011

Data Archival and the JVP

It finally happened - Journal of Vertebrate Paleontology has taken a few more tentative steps into the 21st century! Both in an editorial in the most recent issue (note: full text is paywalled), as well as in an updated version of the instructions to authors, the journal has announced a formal data archiving policy.

What does this mean?

Quoting from the JVP's new instructions to authors, "all data files needed to replicate phylogenetic or statistical analyses published in the journal should be made accessible via the JVP website as online supplementary material." In other words, if you analyzed numbers of any sort, you need to show your source data. This includes cladistic matrices (publication of these is already standard practice) as well as measurements or other data used in statistical analyses. Additional kinds of data - for instance, extraneous measurements unrelated to the study, raw field notes, or raw CT scans - are not included in this proposal (even if it's good scientific practice to make sure this information is available for posterity).

Why is this a good thing?

Data archival allows others to build upon previous work more easily. For instance, let's say I publish a statistical analysis of molar size in the early horses Mesohippus bairdi and Mesohippus westoni. Maybe there is another worker out there who wants to look at variation in some other Mesohippus species. If my dataset is available, it is much easier for another research to quickly advance beyond my work (assuming they trust my data, of course - see below).
Data archival allows new and unexpected uses for data (thus increasing citations). My p-values and arithmetic means of Mesohippus teeth are interesting, but not that useful outside the context of my paper. If I publish the raw data, though, other individuals can use these data (and cite my paper) in all other sorts of contexts. Maybe someone wants to throw the data in her study of horse tooth evolution (hey, it's another citation!). Maybe someone else is interested in Oligocene herbivore ecology as evidenced in molar properties (and there's another citation!).
Data archival ensures transparency. Everyone makes statistical or analytical mistakes. Unfortunately, these mistakes may render the results of a paper highly suspect at best, or worthless at worst. With the availability of raw data, it is much easier for someone to reproduce a study or correct misuse of statistics. (as a case study from my own work, I discovered that nearly all paleopathology studies in the literature were using incorrect statistical assumptions - and a reanalysis of the data forced some new interpretations!) Additionally, taxonomy frequently changes, meaning that previous categories applied in an analysis are hopelessly outdated. Not so, if you can go back to the author's original data, make a few corrections, and rerun the analysis!
Publicly funded research deserves to be public. So much of paleontology research is funded by government grants, or conducted on company time. It is not a good use of our limited resources to keep data locked up after the original study has been published. This is somewhat analogous to writing an NSF grant to collect fossils for one's personal collection. Why should data be any different?

Answers to some common objections

"I have other plans for the data." Some researchers want a monopoly on their data. They have this fear in the back of their head that someone is going to go out and do exactly the same next step study planned by the original researcher. I have several responses to this. First. . .really? Second, I would remind authors that it is bad science (perhaps even unethical) to publish research results that are not transparent to scrutiny. Third, I would remind authors that they are never obligated to publish all of the tangential data. If you are publishing a paper on dentary lengths in hadrosaurs, you don't have to release the data on predentary dimensions too! Finally, I would remind authors that this is just a lame excuse to put off their own follow-up research. We all know the stories of this or that researcher who has sat on a dataset for years. Science is not being helped by keeping those data secret.
"Interested researchers can just contact the authors." As an example of why this is a bad idea, please refer to the work of Leonard Radinsky. He published a number of wonderful morphometric studies of fossil mammals, clearly based on hundreds of measurements. But, he also passed away in 1985. Unless you have a Ouija board that actually works, it's highly unlikely that anyone will be able to exactly reproduce the results in his oft-cited "Ontogeny and phylogeny in horse evolution." Authors leave academia, pass away, or lose their data sheets all of the time. It's a pipe dream to assume that "data are available upon request." [to be fair to Radinsky, his paper did not indicate that the data were available - I just chose it as one prime example where the data are probably irrecoverable]
"It just encourages lazy research by data miners, because you should never trust anyone else's research data." There is a grain of truth in this - inter-observer error may creep into measurements, and maybe a certain author likes to measure plaster reconstructions. But once again, this is just a lame excuse for lazy research by the person who is objecting to data transparency! After all, if you can't trust the data, you can't trust the paper, so what's the point in publishing? It's a slippery slope. The benefits far outweigh the drawbacks.
"It's just more work for the authors." This too falls into the "lame excuse" category. If you've already gone to the trouble to put together an Excel spreadsheet for your statistical analysis, you can spend an extra 10 seconds transferring those data to the manuscript submission system. If it takes you longer than that, you may want to reconsider your data management practices.

Recommendations for JVP

I have just a handful of recommendations for the editors at JVP, based on my own experience as both a data user and a data generator. Some of these suggestions may already be incorporated, and others may be planned. Others may be impractical at this time. Either way, I think it is helpful to consider the following:

Make sure the data files are in a usable format. Historically, supplemental information at JVP has been launched as PDF files (with some NEXUS files). This is great for casual reading, but horrible for analysis. Just try copying 3,382 measurements from a PDF table into an Excel spreadsheet, and you'll see what I mean. This does not mean you need to choose a single format - why not have the data in PDF, Excel, and raw text? Multiple formats ensure maximum usability of the data across multiple platforms (as well as flexibility in the face of future software upgrades).
Consider a data embargo for reluctant authors. Many journals allow a six month or (maximum) one year embargo on supplemental data, to allow authors the chance to finish up any outside projects. Although I philosophically disagree with this option, I see its utility. And, it is an appropriate compromise between protecting author rights and protecting scientific integrity.
Consider partnering with DRYAD or a similar data repository. A number of other evolutionary societies are doing this - why shouldn't SVP be a part of this?
Solicit society input. The members of SVP and the authors of JVP probably have some great thoughts on what they would like to see in data archival. Why not solicit input from the community to find out what the community needs? This will only solidify ownership of the data archival efforts by paleontologists!
Check out a recent publication on this very topic. Michael Whitlock recently published a great review article [paywall] on best practices in data archival - many of the points mentioned above are contained there. (thanks to Randy Irmis for passing the link along)

The Last Word

All in all, I am pleased to see JVP take these steps. Congratulations to the editors of the journal, for taking this stand for good science!

More Reading

Berta, A., and Barrett, P. M. 2011. Editorial. Journal of Vertebrate Paleontology 31: 1. doi:10.1080/02724634.2011.546742 [paywall]

JVP Instructions to Authors [link to pdf]

Whitlock, M. C. 2011. Data archiving in ecology and evolution: best practices. Trends in Ecology & Evolution 26: 61-65. doi:10.1016/j.tree.2010.11.006. [paywall]

Wednesday, February 24, 2010

Two articles of note

First, congratulations to Chris Brochu and colleagues on their paper describing a new species of "horned" crocodile from Olduvai Gorge in Kenya. This probably was an animal that preyed on our earliest human ancestors, as evidenced by a variety of hominid bones from the area with crocodile bite marks. Consequently, this animal has been given the name Crocodylus anthropophagus ("human-eating crocodile"). The paper is freely viewable at PLoS ONE, as a regular web page, XML file, or PDF, and you can also download high resolution versions of the figures. Have a question or comment on the paper? Head on over to the website and get yourself heard! (full disclosure: I am an editor at PLoS ONE)

The second paper of interest concerns the issue of data sharing, which I covered a few weeks back. Following up on a statement published in The American Naturalist, the editors of Evolution have issued their statement (institutional subscription or payment required, sorry) supporting mandatory archival of most data for papers published within the journal. This isn't a huge surprise (they were listed in the earlier article as a supporter), but it's still nice to see something in print. Thanks to Randy Irmis for the notice!

Citations
Brochu CA, Njau J, Blumenschine RJ, Densmore LD (2010) A new horned crocodile from the Plio-Pleistocene hominid sites at Olduvai Gorge, Tanzania. PLoS ONE 5(2): e9333. doi:10.1371/journal.pone.0009333

Rausher MD, McPeek MA, Moore AJ, Rieseberg L, Whitlock MC (2010) Data archiving. Evolution 64: 603-604. doi:10.1111/j.1558-5646.2009.00940.x

Thursday, February 4, 2010

The Decline of Documentation

I'm a huge fan of Matt Wedel's "Measure Your Damned Dinosaur" philosophy. For those of you who aren't familiar with his post on the topic (and seriously, it's probably one of the best pieces of research blogging from 2009), the title is pretty self-explanatory. Despite scads of new techniques, a bloatload of journal options, and the rise of endless supplementary data files, we paleontologists just ain't doing our job anymore when it comes to publishing measurements of specimens. As Wedel said,

"It blows my damn mind that a century ago people like Charles Whitney Gilmore and John Bell Hatcher could measure a dinosaur to within an inch of its life, and publish all of those measurements in their descriptions, and lots of folks did this and it was just part of being a competent scientist and doing your damn job. And here we are in the 21st century with CT machines, laser surface scanners, ion reflux pronabulators and the like, and using a narf-blappin’ TAPE MEASURE is apparently a lost art."

Just for giggles, I decided to find out if things really were better in the past, or if we're just waxing nostalgic for a golden age of documentation that never existed. Being someone who is number-inclined, I grabbed a bunch of ornithischian data from The Open Dinosaur Project. Using some handy-dandy spreadsheet functions, I extracted data for the year of publication for a series of measurements as well as the number of relevant limb bone measurements for that paper that made it into our database.

Then, it was time to run statistics! I wanted to see if there was a correlation between year of publication for a specimen's measurement and the number of measurements published for each specimen. So, I ran a non-parametric test of correlation (Spearman's rho, or ρ). Care to guess what I found?

Sadly, Wedel is right. There is a negative correlation between year of publication and number of measurements: ρ = -0.44, P less than 0.0001.

So then I thought, there are a lot of papers that have just published a single measurement of an isolated bone, or a whole table of single element specimen measurements (e.g., femur length for 20 different species). Maybe that was biasing the dataset. Thus, I trimmed out all of the entries that had only one measurement. Still, there was a significant negative correlation (ρ = -0.27, P less than 0.0001). The average paper published between 1920 and 1930 had 18.5 measurements; between 2000 and 2009, 14 measurements.

Have our dinosaur skeletons gotten less complete? Or have we given in to the need to squeeze less information in less space, and perhaps a little laziness on the side? What will it take to change this trend? It's all food for thought.

Caveat: This is a highly unscientific, probably very non-random sample. Oh well.

Thursday, January 28, 2010

Where is paleontology?

Last week, many of the leading journals in evolutionary biology - including The American Naturalist, Molecular Ecology, Journal of Evolutionary Biology, Evolution, and a number of others - announced a data archiving policy. In short, this policy states that the data behind the results of a paper should be publicly archived in well-known respositories such as Data Dryad, GenBank, or TreeBASE. Do you notice anything missing in this illustrious list of publications?

Not a single one of those journals explicitly focuses on paleontology. Last time I checked, we paleontologists like to think of ourselves as evolutionary biologists. Time and time again, we lament how we're not allowed a place at "The High Table" of evolutionary thought, and how paleontology is viewed as largely irrelevant by the "people who matter." So why weren't any paleontology publications on this list? Will we see any on the list in the near future?

The article in The American Naturalist gives a good run-down of the arguments for sharing data, so I'll only briefly summarize them here:

It allows reproducibility of analyses.
It allows others to build upon your work more easily.
Papers that release their data may get cited more frequently.
The data will be lost to science otherwise.
It's the right thing to do.

And to counter some potential objections:

This would only request the release of data directly relevant to the study. Not your pages and pages of raw notes. Just that Excel spreadsheet that you already generated on your way to the analysis. Seriously. It's not a lot of extra work, if any.
This is not requesting the digitization and distribution of video, CT scan, or similarly large and unwieldy data (although that would be nice in the future).
No, it does not mandate the release of locality data, or similarly privileged information.
The policy does not require immediate release of the data, if there's a good reason (i.e., another pending publication) to do so. I'm not sure I entirely support this (if you're publishing the analysis, you should publish the data), but I understand it as a necessary compromise to get more individuals on board. I won't let the perfect be the enemy of the good.

Some of the most ground-breaking and high-profile work in paleontology is happening on account of large meta-analyses of data pulled together from the literature - largely thanks to efforts like the Paleobiology Database. This work has real implications for big questions facing our science and our world: Climate change. The pace of evolutionary radiations. The origins of modern biological diversity. These sorts of databases focus primarily on geographic, stratigraphic, and taxonomic data - but think how much more powerful they could be if all of the morphological data ever published were available! Or if the PBDB volunteers didn't always have to transcribe the information from a PDF file. And look at the great strides that molecular biology has made with the ready availability of sequence data on GenBank! This would not have happened with a mentality of data hoarding.

Look. Amateur hour is over. If we want to play in the big leagues, we have to start acting like a real science. Real science is reproducible. Real science is data-driven. Real science involves sharing data. Yes, I know it's hard. It's new. We haven't done things this way before. There are potential problems. Not everyone is adopting it quickly. But if we always wait five years to "see what happens," we paleontologists quite frankly don't deserve a place at the High Table. Let's be leaders, not followers.

References
Piwowar, H. A., R. S. Day, and D. B. Fridsma. (2007). Sharing detailed research data is associated with increased citation rate. PLoS ONE 2(3):e308, DOI: 10.1371/journal.pone.0000308.
Whitlock, M., McPeek, M., Rausher, M., Rieseberg, L., & Moore, A. (2010). Data archiving. The American Naturalist, 175 (2), 145-146 DOI: 10.1086/650340

For previous posts on data sharing in paleontology, see here and here. Want to get involved? Spread the word. Talk to your local journal editor. Let the people who count know what you think.

Tuesday, December 1, 2009

Data or Hypothesis?

Skull of modern human (Homo sapiens), reconstructed from CT scan of original specimen using 3D Slicer 3.4. Data from OUVC 10503, downloadable at the WitmerLab web page.

As the headline says: data or hypothesis? Discuss.

Wednesday, November 19, 2008

New On the Web

Ok, I've not been so good on posts lately. But in the spirit of fixing that. . .here are two fantastic links in the vein of open science.

First of these is the Witmer Lab's 3D Visualization Page. As most of you paleo types know, Larry Witmer and his students have been doing a hefty bit of CT scanning these days. Now, they've begun to put some selected data sets up for downloading, in the original DICOM format. I played with the Ankylosaurus and Pinacosaurus data a bit today. . .great quality scans!

The second is the Royal Tyrrell Museum of Paleontology's searchable collections database. This one had flown under my radar for awhile, because it's not actually linked from the museum's web page. I only found out about it by a brief mention in Patty Ralrick and Darren Tanke's new paper in the Pachyrhinosaurus monograph.

That's all for now. . .upcoming posts will finish the goat head FEM series, and also talk about the rates at which supplementary data are actually accessible.

Tuesday, May 20, 2008

Data and the Open Source Paleontologist 2

The previous post on this topic outlined some resources for the posting and dissemination of primary paleontological data on the internet. In this post, I'll take a look at why more people don't do so, and what we can do about it.

Why Aren't More Data Posted?
Myriad factors contribute to this issue - some of them are genuine roadblocks, and others are simply opportunities to change attitudes and common practice.

Laziness
Sometimes it's a lot of work to get your data posted online. You may have to reformat everything, or re-enter the data, or engage in digital gymnastics that take longer than the research itself took. In other cases, it's just one more thing to do on an already crowded research schedule. How to counter this? Perhaps my best suggestion is better awareness of the importance of these data being available - if people demand it, it will be viewed as an item of high importance, just as needed as the peer-reviewed publication itself. Some repositories, such as MorphoBank, also allow you to enter the data as you collect them, rather than doing the whole thing at the very end. This might also be a good talisman against the rush to upload a whole bunch of data files at the end of a program.

Museum Policies
In the case of posting photographs of specimens, many museums have policies that are unclear or seem to prohibit general dissemination of photographs. These policies are in place for good reason in some cases - this discourages commercial concerns from profiting off of images of specimens without a museum's knowledge. Although it's my understanding that most museums don't have a problem with posting things into scientific databases, it's probably best to check. Does anyone out there have experience with this issue?

Priority of Publication
If your data are online, this means other people have access. This can lead to productive collaborations - or, it could potentially lead to being "scooped." Here, the safest thing is to delay uploading of data until after the major resulting publication. The important thing is to get those data out there! And, if you use data from an online database, you have a responsibility to credit the person who did the primary work. Anything less just isn't very nice. There are always going to be people who are stingy with sharing already-published data, even when it isn't warranted (or in the case of CT scan data, even when the museum requests that a publicly-available copy be reposited with the institution!). The most important thing is to work to change attitudes and foster a culture of openness. Recent events in paleontology have perhaps made this a little more difficult, but I like to think that things will work out in the long run.

What Can Be Done?
Above, I've outlined a few solutions to some of the problems. In addition to the suggestions given above (some of which are more practical than others), I think we really need more databases. And more encouragement to use these new (and existing) databases. Gene squeezers have GenBank, but why aren't there more Paleobiology Databases out there? Advisors - make your students reposit their data online. Students - get your data out there, even if your advisors don't encourage it! And paleontologists in general - welcome to the 21st century! I hope that time and a new generation of tech-savvy paleontologists will change all of this for the better.

Thursday, April 17, 2008

Data and the Open Source Paleontologist

Paleontological research generates data, and lots of it--photographs, measurements, CT scans, character matrices, etc. Data are the cornerstone of most good papers, yet for reasons of space and journal style, often never make it into print (or are relegated to that zone of "online supplementary information"). This is a Catch-22, because anyone hoping to evaluate, reproduce, or build upon your work needs these data.

With the growth of digital media and the internet, things are beginning to change. It's now much easier for paleontologists (and other scientists) to make available primary data - if they choose to do so. This post surveys a few on-line data repositories that are out there, and looks to the future. I'm going to focus on those that are most relevant to paleo types (sorry, no GenBank).

MorphoBank
MorphoBank is an on-line data editor and repository for cladistic data matrices. Registration is required to start your own file, after which you can upload images and data matrices. The image upload is particularly nice, because it allows you to link a character coding in a taxon to an image of a particular specimen. This means that someone trying to figure out your character states actually has a prayer of understanding what is meant by "mastoid process elongate (0) or fungiform (1)." The only real downside is that, at present, there doesn't seem to be support for uploading large CT datasets.

Dryad
This is a relatively new site, intended to archive the basic data underlying publications in evolutionary biology. A number of partner journals have signed on (e.g., Evolution, Systematic Biology, etc.), but unfortunately no paleo journals are there yet. One set of paleo data is available on Dryad, related to the Xenoposeidon type specimen. Kudos to Mike Taylor and Darren Naish for that! Data that could be archived here include photos, data matrices, measurements, and other media. Because the site is in such an early stage, the amount of available data and the search functions are relatively limited currently.

Biomesh.org
This is another relatively recent website for which I have high hopes. The site focuses on finite element modeling in vertebrate biology, with background information and material properties databases. Of even greater interest is an area where published FE models can be downloaded for others to try out. It would be really, really nice if more researchers went ahead and put their models out there!

DigiMorph
This is one of the earliest data archives out there, focusing on CT scan data. Interested users can download movies of 3D reconstructions or slice sequences, download surface models (usually STL format), or read more about the scans. Unfortunately for most specimens, there is no way to download the actual data - so if you want to analyze some part of a specimen, you're out of luck (unless you contact the Digimorph folks directly and have them mail you a DVD). I had high hopes for the UTCT Data Archive, which did post TIFF and JPG stacks of images. But, this effort seems to have lost its wind, and very few datasets have actually been posted. Regardless, DigiMorph has done an admirable job of getting at least the basic CT data out there for a number of publications.

MorphologyNet
This is another new website, appearing in just the last few months. The basic goal is to make available 3D reconstructions generated from serial section data (whether CT or "old-fashioned" thin sections), in an environment where you can rotate and examine the specimens. Because it's in early stages, content is mostly limited to frog specimens (but how cool they are!). All files are in OBJ format, for which a Windows and Mac viewer is provided (I had no problem getting it to run in Wine, once I turned off virtual desktop). For objects with multiple parts (for instance, a frog head with bone and brain segmented separately), you can change colors on certain pieces or make them transparent. It's a nifty little toy for viewing morphology in 3D. The only downside is that the software features are pretty limited (turn part on or off, change color, change transparency), and you can't take measurements of any sort. Also, the raw data from which the reconstructions were generated aren't available. But, it's another great way to get 3D morphological information out there!

Paleobiology Database
This database brings together faunal, floral, and stratigraphic data from a variety of published and some unpublished sources. It's a fantastic resource for looking at patterns of distribution, extinction, and diversification. Detailed morphological data (beyond body size or tooth measurements) and images are pretty much absent, because they are beyond the primary scope of the database.

Coming Next. . .
As you can see, a number of resources (and these are just a few highlights) are available already. But, a casual user will notice that paleontological data are pretty scarce on many sites, and data capable of further analysis are even more scarce. In the next post, I'll examine the reasons behind this, and what we can do about it.