Tag Archives: bioinformatics

Where am I? Or, adding a counter to a running R process

A quick post in response to a question that came up on Facebook via Liz Freedman and Nic Campione: how do you add a counter/ticker to a running process in R? This is something that can be really helpful if you’re still in the process of trying to debug some code, as a process that involves thousands of iterations may be either stuck and broken, or it could be just taking its sweet time to finish, and without any output during the iterations, you can’t know which it’s doing. So, in one line, here’s a simple way of adding a counter to a for loop (although while and other style of loops follow the same logic):

for (i in 1:1000) {
  if (i%%100==0) print(paste("This is iteration number ", i))
  #all your complex code can go down here, whatever it is
}

In this case, I just added a rule that printed the iteration number every 100 loops (the code can be run directly in a terminal if you want to see what I mean). The two percent signs right beside each other are the modulo operation, so that when the remainder of the iteration is zero (that is, every 100 loops), it prints out a message. The number can be changed easily so that it is every 1000 instead of 100, and you can also use different rules, but I find that this is a relatively quick and easy way to add a counter.

As for making the whole process run faster, using something like the foreach package (and associated packages like doMC or doParallel) can significantly speed things up if you have a multicore processor on your computer.

TNT and Ubuntu

TNT (or ‘Tree analysis using New Technology’) is a relatively fast parsimony-based phylogenetics program. However, to be quite frank, it is also totally confusing. It took me some time, but I have put together a script that does pretty much everything I want it to do now, so I can just reuse the script every time I need to do something. I’m posting it here, in the hopes that other people will not have to spend so long figuring this all out. Just note though, the actual commands are in this type of font.Of course, everything that follows is what I’ve figured out through reading papers and trawling the internet, so suggestions are appreciated.

First things first, make sure your data is in the correct format. Theoretically, TNT can handle .nex files, but then again, my old car could theoretically do 220 km/h because the speedometer went that high. Your best bet is to just take your .nex file and reformat it to .tnt format, which pretty much involves deleting everything except the actual matrix, and then editing the first few lines. The first line should read ‘xread’, with the next line containing the number of characters, a space, and the number of taxa. (You can add a line between those for a comment, just be sure to put the comment in quotes)

For example, a simple data set might look like this:


xread
‘sample phylogeny’
5 3
taxonA 00011
taxonB 01111
taxonC 11111
;

While TNT is not the most user-friendly program, using scripts can help a lot, as you often will run the same set of analyses on different data sets. For my purposes, most of the time I want to run a basic tree search to find all of the most parsimonious trees, find a consensus tree from that (if necessary), map the characters, get bootstrap values and finally find the Bremer supports. At the bottom of the page I’ve posted my default TNT script that I use. Note that it does include comments (lines with a # in front) that will need to be deleted for the script to run properly. Also, replace ‘samplephylo’ with whatever you’ve named your .tnt file. Finally, when you’ve done all that, you need to move both the script and the .tnt file into the folder that contains all the files for the TNT program (on mine, the folder is called ‘tnt64-no-tax-limit’).

Finally, once you’ve done all that, you can run TNT (usually by navigating in a terminal to that same ‘tnt64-no-tax-limit’ using the change directory command [i.e. cd]), and then running TNT by typing something like ‘sudo ./tnt’. Note that TNT typically doesn’t run properly unless you run it as root/administrator (same applies to Mac OS X, except you’d run it as ‘sudo ./tnt.command’).

Often, TNT will throw error messages for no apparent reason. A few things to make sure of is that all the file names are letters or numbers only (no punctuation, including periods or underscores, and no spaces), that both the command script and .tnt file are in the same folder as the TNT program and that the program is being run as root/administrator. Hopefully by putting this up here, more people will be able to finally figure out how to use TNT.


#take everything after here and put it in a file in the tnt folder
#delete all the lines that start with the # sign
#replace the ‘samplephylo’ with the name of your file. I recommend converting
#your matrix to .tnt format, which is pretty simple
#check out the example.tnt file in the tnt folder to see how it’s done
#the most imprtant part is the first few lines,
#but be sure to convert any polymorphisms to [] brackets
#load the data file
procedure samplephylo.tnt;
#write all the output to this file
log samplephylo.out;
#do a basic run – finds a bunch of parsimonious trees quickly
mult ;
#this should find all the MPRs
bbreak=tbr;
#find the consensus tree
nelsen * ;
#map the synapomorphies on the tree
apo [ ;
#export the MPRs and the consensus tree to a file
export samplephylo.tre;
#do a bootstrap analysis – might need to up the runs
resample ;
#the stuff below does a Bremer analysis
#it’s a bit convoluted, but is the only way to get
#TNT to do a Bremer, because TNT isn’t really meant to
#do a Bremer analysis
hold 1000; sub 1 ; bbreak=tbr;
hold 2000; sub 2 ; bbreak=tbr;
hold 3000; sub 3 ; bbreak=tbr;
hold 4000; sub 4 ; bbreak=tbr;
hold 5000; sub 5 ; bbreak=tbr;
hold 6000; sub 6 ; bbreak=tbr;
hold 7000; sub 7 ; bbreak=tbr;
hold 8000; sub 8 ; bbreak=tbr;
#calculate the Bremer support from the suboptimal trees
bsupport ;
#quit the program – you can always do this manually
quit

MrBayes and Ubuntu

A quick tutorial on how to install MrBayes on an Ubuntu system.

Step 1. Open Software Centre

Step 2. Search for ‘mrbayes’ and hit enter

Step 3. Click ‘Install’

Or, alternatively, open a terminal and type ‘sudo apt-get install mrbayes’. And to run MrBayes, open up a terminal and type ‘mb’. That’s it. I just spent 15 minutes downloading and compiling the software from source, when it was already done for me. Oh well.

ape and geiger in R

Just a quick tip if you want to use the geiger package in R on an Ubuntu system, there are a few things you need to make sure you do. First, make sure you’ve installed the gfortran package and lapack-dev package in order to build the source packages. As well, you may need to reinstall the r-base package after you’ve done this. I was getting an error message like this:

/usr/lib/liblapack.so.3gf: undefined symbol: ATL_chemv

It seemed to go away after I reinstalled r-base and restarted the R session, although I’m not sure I needed to do both. Either way, it works now.

64-bit MrBayes

Just a quick note: if you’re running a 64-bit install of MrBayes in parallel, you will probably need to use this page to apply some patches to the source code:

http://technical.bestgrid.org/index.php/Bioinformatics_applications_at_University_of_Canterbury_HPC#Installing_MrBayes

I was having problems getting MrBayes to work on a workstation running 64-bit Ubuntu, but once I started fresh and first applied those patches, things seemed to work without throwing a ‘segmentation fault’ error during the ‘sump’ function.

R and fossils

I recently had a paper that was published in Palaeontologia Electronica on my ‘fossil’ package for R. The journal is open access, so anyone can read it for free.

Vavrek, Matthew J. 2011. fossil: palaeoecological and palaeogeographical analysis tools. Palaeontologia Electronica, 14:1T. http://palaeo-electronica.org/2011_1/238/index.html

MrBayes and multicore processors

Turns out setting up and using MrBayes on an Ubuntu system is much easier than I had thought. If all you want is the normal (serial) version of MrBayes, you can just download it from the repositories. However, if you want a serious speed up in the time it takes to get a good result, you can also run it in parallel on a multicore system (that is, pretty much any computer made in the last 4 years). To get it set up and running on Linux, I used some information I found in a forum post. To recap from there:

  1. Install the parallel libraries you need from the repositories. The package names I used were: mpich2, libmpich2-dev, and libmpich2-1.2, and libreadline6-dev.
  2. Download the source code file for MrBayes and unarchive it (on Ubuntu you can just right-click and select ‘Extract Here’
  3. Find the ‘Makefile’ in the source code and change the line that says ‘MPI ?= no’ so that it says ‘MPI = yes’
  4. Open a terminal, and navigate to the MrBayes folder (e.g. type in ‘cd /path/to/folder/mrbayes-3.1.2/’) and then make the package (type ‘Make’ at the prompt).  It might also be a good idea to change the file called ‘mb’ that is created to something like ‘mbpar’ so that you know it’s the parallel version. Also, I needed to make the file executable, so I typed ‘chmod +x mbpar’  to do that.
  5. Now, you’ll need to create a file in your home folder called ‘.mpd.conf’ with the line MPD_SECRETWORD=<secretword> in it. Change the <secretword> to something else though; it can be pretty much any word you like.
  6. ‘mpd &’ will launch the MPICH daemon, which needs to be running in order to handle communicating between the different cores.
  7. After all this, I was able to type ‘mpirun -np 2 /path/to/mrbayes/mbpar’ to run the parallel version of MrBayes in parallel on both cores of my dual core system. If you have more cores, you can always change the -np argument (e.g. to run it using 4 cores, type ‘mpirun -np 4 /path/to/mrbayes/mbpar’)

With the few tests I’ve done so far, I’ve seen about a 80% speed up by just using 2 cores instead of 1. It’s nice not to have to wait nearly so long to get my results. We’ll see what kind of time savings this could bring if I did it on an 8-core computer.

Biology and Computers

I recently ran across a nice summary of two papers in Science about the lack of understanding in biology of the way biological software, and computational biology in general, operate. The Science article authors opine about the lack of knowledge in biological labs of how the computational tools — gene sequencers for example — actually operate. I would tend to argue that the same could be said for almost any research lab today, with virtually any subject. I feel these researchers are perhaps longing for the mythic days of yore, when a man of knowledge did know much about everything, as there was arguably much less to know. I think every research lab out there probably is lacking in it’s knowledge of something, whether it’s the biologist who doesn’t understand his computational tools, or the computational biologist who knows too little about how actual organisms live and reproduce. So while I can understand why they worry, I wonder if this is just simply the way things are to be in a world of super-specialists.

Original articles:

Science, 2009. DOI: 10.1126/science.1173876
Science, 2009. DOI: 10.1126/science.1176016