The Open Source Paleontologist: November 2007

Thursday, November 29, 2007

PAST in Linux

Talk about good timing. In my last post, I lamented the incompatibility of the latest versions of PAST with Linux. Just this morning, Alejo C. Scarano posted a tidy little work-around on the PAST users mailing list. Many thanks to Alejo for sharing this! Here's what to do:

First, you need to get the latest version of WINE (0.9.49). For this, go to http://www.winehq.org/site/download and follow the instructions for your distribution.
Install this latest version of WINE, following their directions.
In the WINE configuration menu (winecfg), turn on "virtual desktop" under default settings for WINE, and then set PAST to run using the global settings. A more elegant way to do this is to call PAST from the command line, by using a command similar to:

wine explorer /desktop=hl,1024x768 c:\whereeverpastis\past.exe

Friday, November 23, 2007

Statistics Software

Most any paleontologist will, at some time, have to delve into statistics in order to answer some sort of question related to his or her research. Unfortunately, many of these statistical tests exceed the options available in Excel (I find it highly unlikely that Excel will ever have a principal components analysis, for example). So, what's a researcher to do? In this post, I'll address some of the statistical packages available out there as freeware or open source software.

PAST. This is probably the easiest to use statistical package out there, and it is geared especially for paleontologists (as you might guess by the name, which is short for "PALaeontological STatistics"). You can run diversity indices, PCA, and a whole bunch of other methods. The interface is quite user-friendly, although it has occasional quirks in how it wants the data aligned in the columns. Bugs, once reported, are quickly ironed out, and new features are added relatively frequently. The statistical plots it produces are generally quite good, but there aren't a lot of options to customize them. The website and documentation are generally pretty good, if a bit simplistic (in the case of the documentation). Unfortunately, after version 1.56b, you can no longer run the software under WINE in Linux (but you can download version 1.56b from the PAST site). Available for Windows
R. The gold standard in statistical analysis--this is for people who are really serious about their data. One big plus with R is that it handles large data files without batting an eye - this was a lifesaver when I had FEM outputs with over 150,000 values (to be fair, PAST loaded this too, although much more slowly and only with a lot of data massaging. SPSS choked.)! R has a very active development community, and you can find packages to do just about anything. The big downside (for some users) is that R is command-line only (although front ends such as R Commander now allow access to some, but not all, of R's features via a graphical user interface). But, it is incredibly powerful, and it is very easy to set up little scripts to run through whole masses of data in a matter of seconds. The graphical outputs are highly customizable and easily exportable into widely used formats. The user's manual is some of the best I've ever seen in open source software, too.
(S)MATR. This handy little program, available as a standalone executable for Windows (also running under WINE in Linux), an R package, or a MATLAB toolbox, will fill all of your reduced major axis regression needs. It's fast, powerful, and about the best way I've found to deal with data that don't meet the assumptions of Model I regression (ala Sokal and Rohlf). The downside is it doesn't produce graphical plots - but it does all the statistical tests that PAST doesn't.

Any search out on the web will also uncover other statistical packages. The above are just the ones with which I am most familiar.

Wednesday, November 7, 2007

Open Source and Free Software: Pros and Cons

Just like commercial software, open source and free software has its pluses and minuses - for anyone who is new to the concept, it's important to be aware of all of these. A small list follows here.

The Cost: In terms of dollars, open source couldn't be better. You pay nothing (unless you choose to donate to the projects), and get a piece of software that you're free to use and install on as many computers as you wish. No annoying anti-piracy dialogs, no serial numbers, or anything. In most cases, you can give copies of the software to your friends without any restrictions.

Timeliness: This is where open source and free software often shines. New versions of the Linux operating system version Ubuntu are released every six months - compare this to two and a half years for new versions of Mac OS X or five years for Windows. For the statistical analysis system R, there might be four months between releases--and smaller updates and new extensions are added constantly. For many open source, programs, bug fixes and new features are added nearly continually. That is, if you're lucky. Some projects lose steam or just plain die--just as happens with some commercial projects. It's worth it to do a little investigation on an open source or free software project in order to see update histories and if there is a prospect for long-term continuation of the software.

User support: This is a little more variable. Some open source programs--such as R--have have excellent documentation, in the form of lengthy user's manuals and active support forums. Others have no formal user's manual and little or no user community. It completely depends on the software package--again, you'll probably want to do a little research.

Compatibility: Here is another area where you'll want to do some homework. For most purposes, many open source programs will read documents from their closed source cousins relatively easily. For instance, OpenOffice.org can open most Microsoft Office documents--unless the latter has some really fancy formatting or odd macros. Export is also usually a pretty reliable thing--and let me emphasize usually. But don't forget--many of these issues plague commercial software, too!

Required Geekiness: Some Linux distributions (basically, "flavors" of the operating system) practically require a degree in computer science to install and use them. Others, such as Ubuntu, are now at the point where a reasonably computer illiterate person could use them with ease. Similar concerns apply to other programs. If you can use Microsoft Office, you can use OpenOffice.org. But, it takes a decent bit of patience to get R working with your data (although it is worthwhile to note that graphical interfaces for R are now out there).

Saturday, November 3, 2007

Why Do I Use Open Source Software For My Research?

This is one of those really tough, multifaceted questions. As this is intended as another introductory post for the blog, I'm going to start out rather broadly and then move to specifics.

First, some definitions. Most software with which you're probably familiar, such as Microsoft Office, the Windows or Macintosh operating systems, or Endnote, is "closed source." This means that the source code (the lines of programming that tell the program what to do) is not available to the general public. It also typically, but not always, translates into "commercial software." For a commercial software model, this makes sense--why give away your trade secrets?

Another broad category of software is "open source." This means that the source code is available for anyone to download, modify, rebuild, or improve. If you run the Firefox web browser, you're already using open source software. Linux (in its various forms) is an open source operating system. Open source software is usually free, but many companies (such as Novell and Red Hat) sell technical support. Not all free software is open source, though. A good paleontological example of this is PAST. Because some of the software libraries used to develop this free statistical analysis program are copyrighted, it is not possible for the source code to be released.

For me, the descent into open source software started as a matter of necessity. A few years back, I started getting CT scans that I needed to analyze as part of my dissertation. I didn't really have a few thousand dollars floating around to buy one of the commercial packages (such as Amira), so I poked around on the internet and stumbled across the program called 3D Slicer. It turns out that this was perfect for what I needed to do (more on this in a later post)!

Fast forward a few years to this past March, when I bought a new desktop computer for my office. By this time, Windows Vista had come out, and it was the default operating system for most computers. The nature of my research--which includes lots of analysis of big CT scan datasets and finite element modeling--is extremely processor and memory intensive. The relatively massive memory requirements of the Vista interface, along with its new memory management strategy, did not mesh well with my computing needs (even after turning off unnecessary features). Thus, I turned to Linux and all of its associated free and open-source programs, and I haven't looked back.

Today, nearly all of my computer-based research time is spent in open source software. I write manuscripts in OpenOffice, browse in Firefox, manage my library in Zotero, do statistics in R, and analyze my CT scans in 3D Slicer. I also make use of some free (but not open source) programs such as PAST. I've retained a Windows installation on my hard drive, primarily because the commercial finite element analysis software I use is Windows-based (and there aren't any good open source alternatives that I have found to fit my needs yet). Otherwise, it's all Linux, all the time. Yes, that officially makes me a nerd.

So ultimately, then, why do I use open source and free software?

The price is right - free!
The software does the job I need it to do, and it is getting better all of the time.
Some features I have only found in open source or free software.
It provides a bit of an interesting challenge.

In the next series of posts, I am going to discuss the pros and cons of using open source and free software (lest you think I'm an anti-Microsoft, Linux fanboy). Stay tuned for more!

Thursday, November 1, 2007

So What's With the Title?

There's a great myth in paleontology these days--that you need technology, and expensive technology at that, in order to do quality research. This includes everything from CT scanning and computer reconstructions to statistical analysis.

Let's face it. Paleontologists are often uncreative when it comes to the use of technology. We use this protocol or that program because we saw someone else use it in an SVP talk. But what if you want to try something different? Do we always need commercial software for our research? Is that fancy-dancy analysis even necessary or useful?

Over the last few years, I've been on the slippery slide into open source software. It started when I wanted to analyze CT scan data without a $4,000 piece of software. The next thing I knew, I had Ubuntu Linux as my primary operating system and I was writing dissertation chapters in OpenOffice. This blog will review how these and other pieces of software may be of use for other folks in paleontology (or other sciences). I'll highlight some of the software I've been using, along with pros and cons.

Finally, I want to make this something other than just a technology blog. In the spirit of open source, I will discuss advice on graduate school, grants, and other topics, with the hope that this information will be useful for many people starting out in their education. And, because I'm a paleontologist, I'll probably discuss that topic a little bit, too.