Transect Layout in R

I’m working with an awesome group of folks at NOAA’s Northwest Fisheries Science Center on a project investigating biodiversity of eelgrass habitats around Puget Sound, and our team (the Kelly Lab) is assessing the feasibility of environmental DNA methods for this type of study. If we’re to be confident in our ability to detect differences among sites, we first need to know how far DNA molecules (and tissue fragments containing them) travel before degrading in these tidally influenced habitats.

To do this, I needed to set up some transects running away from a known DNA source along which we could collect water samples from a boat. I knew (1) where I wanted to start each transect, (2) what direction I wanted the transects to run, and (3) what distances along the transects I wanted to sample. I needed to get coordinates for sampling point along the transects–nothing too complicated, just some simple “SOH-CAH-TOA” with the twist of sampling on a round surface (the earth). As is often the case, I wrote some really sloppy code myself before stumbling on exactly the function I needed: destPoint in the very useful R package geosphere. It’s incredibly simple, and you can find some code demonstrating its usage over here on my GitHub page.



Legend for Haplotype Networks in R

When plotting a haplotype network in R with the package pegas, the default legend can be pretty awkward.  A user recently pointed out that at least in some cases, the colored legend circles are impossibly hard to see. Pegas does some porting of the generic plot function to plot haploNet objects, but the way in which it does so was not immediately obvious to me after a quick dig, and which does not permit passing of arguments to the function legend. I see two potential workarounds: (1) Suppress pegas’s plotting of a legend (legend=FALSE) and then plot a custom legend, or (2) plot bigger colored squares over the tiny dots given by the default plot. The second option was easier to rig up quickly, and avoids having to make sure the scale of the circle and line match up with the plot. To do this, I added the following lines to the existing code:

plot(net, size=attr(net, “freq”), scale.ratio = 3, cex = 0.5, labels=TRUE, pie = dc, font=2, fast=TRUE, legend=c(-100,90))

legend(-108,83, legend=rep(“”,11), fill=rainbow(11), bty=”n”, cex=1.2)

Which produced the following plot:


Interpreting Structure Plots

The software STRUCTURE is a useful tool for assessing population structure using genetic data from multiple loci. The typical way of visualizing its output is essentially a bar plot where each bar represents an individual, and the bars are filled by colors representing the likelihood of membership to each cluster. I think it usually takes looking at a couple of plots before one really “gets” what the plot is depicting, so in a presentation I think it is useful to walk the audience through the expected output under different scenarios. So, I wrote an R script to illustrate the expected output from structure where there is either high or low gene flow among clusters, for 1-4 clusters. Note that the plots do not represent the expectation of the output for the same data analyzed under each clustering scheme. You can find the code on my Github page here.


Puzzling BEAST error

If one were to trace back in time the sequence of a given stretch of DNA, that gene’s history wouldn’t necessarily perfectly represent the history of the lineages of individuals harboring it. Thus, if you want to know the true history of lineages of individuals, you should compare the lineages of multiple genes. One can accomplish this using the software package BEAST, using the atrociously named extension *BEAST (pronounced “star beast”). Until you can pronounce words into search engines, cleverly named programs like this will be a pain in the neck for users.

BEAST comes with a separate program, BEAUti, to generate the XML file it uses as input. After loading my alignments into BEAUti with no apparent problems, specifying a bunch of parameters, and exporting the XML file, I got this error when trying to run the analysis in BEAST:

java.lang.Exception: Could not find a proper state to initialise. Perhaps try another seed.

Amply vague, but after much gnashing of teeth, searching of documentation, and reinstalling unstable versions, I discovered that my alignments were the problem.

BEAST requires a species be represented by at least one sequence in the alignment of every locus, whether you have data or not. That is, if you have sequence data for 10 species for 20 loci, but are missing data for one species at one locus, you MUST include a dummy sequence for that species in the alignment of that locus. I think one part of my confusion somehow came from the fact that it does not matter to BEAST whether sequence data across loci comes from the same individual. So a sequence named similarly to other alignments but filled with N or ? will suffice.

Note also that the estimation of a species tree using a coalescent approach (as performed by BEAST) is premised on the fact that each species is represented by multiple individuals for multiple genes. Missing data should be problematic, though it is not clear to what degree, particularly with empirical data. Remember that the other side of the “missing” data coin is that additional but incomplete data may strengthen an analysis. In the case above, 19 loci for 10 species may not perform as robustly as 20 loci, even if the additional locus has not been sampled from one species. I always like to look on the bright side of things.

To be absolutely certain there’s no confusion: When using *BEAST, EACH NEXUS FILE NEEDS TO HAVE AT LEAST ONE REPRESENTATIVE FROM EACH SPECIES.

Combine heterozygous DNA sequences in R

Most folks know  the nucleotides that make up strands of DNA are represented by letters (A, G, C, T). But if you’re unsure of the nucleotide at a given spot, it can be represented by another letter that conveys your level of uncertainty. For example, “I know there’s a nucleotide here, but I have no idea what it is” is represented by an N. “This is either adenine or guanine (A or G)” can be represented by an R.

Like humans, most of the organisms we work with have DNA in the nucleus of their cells that comes in two copies, but in some places the sequences may not be identical. For example, if the strand inherited from your mom is ATG and the strand from your dad is ATA. In that case, if we want to use a single sequence in our analysis, we would code that piece as ATR.

I’ve run into this in with my own work on small scales, but a lab mate was having this issue with his data (restriction site associated DNA sequences generated from last-gen sequencing aka massively parallel sequencing on an Illumina machine). We didn’t try very hard, but couldn’t quickly find any functions already written in R or Python, so I wrote one that would do this for any pair of sequences of equal length. Note that because there are a large number of possible file types for storing DNA sequences (the number is proportional to the number of scientists who think they should reinvent the wheel in order to get more citations), I didn’t generalize it in any way. Thus, it’s up to you to get your sequences into R, and then use this function paired with one of the apply functions to find and consolidate the heterozygous sequences. You can find the code here.

Haplotype networks in R updated

I’ve gotten a good deal of feedback about the haplotype network script I posted (apparently nobody cares about pad thai around here!). A colleague was having trouble getting names to match up in her sites file and alignment. After reviewing the script, I noticed the cause of the problem: my self-taught hack programming skills. Perhaps something changed since I wrote it, but more likely, there was something specific about the sample and site names of the data I was working with that caused the script to work at that time. In any case, I’ve fixed it now, and uploaded working example files. I remain too lazy to match up names using the function match(), but will probably get around to that at some point. You can get the goods over at the resources page or here.

The offending line:


Should instead read:


Stranger Than Fiction

Something I am regularly appreciative of is the bottomless depth of weirdness in the biological world. While curiosities like the sexual relationships of anglerfish or Cordyceps’ conquest of insect brains may seem unusual, weird things are always right under our noses, because in biology, strange is the norm.

IMG_6953 copy Beetle mites

Not long ago, this carrion beetle flew into the porch light up at my house. I went in for a closer look, and noticed the back had an unusual bumpy texture. As I grabbed it, the bumps began to move and drop off, as you might imagine crawling skin to look. By the time I managed to get this photo, all but a few of the bumps had dropped off and started to crawl away, but I could tell they were mites. Very strange. I assumed them to be parasites, but a little Googling unearthed a much cooler story.

These little mites (genus Poecilochirus) are commonly found on carrion beetles, but are not only harmlessly hitching a ride, they actually help out the beetles. As you might have guessed, carrion beetles feed on and lay their eggs in the decaying carcasses of small animals, but so do other insects like flies. Beetles tend to be a little slower than flies when it comes to finding carcasses, leaving them at a competitive disadvantage: The fly larvae will not only consume more of the food, but the developing beetle eggs as well. But, when a beetle arrives that is carrying mites, the mites hop off and immediately start working the carcass over, looking for their primary food source, fly eggs and larvae. So, it turns out that the mites are mutualists that mediate the beetle’s tendency to arrive late to the carcass party. Things like this make it pretty easy to stay enthusiastic about biology…

Below: A carrion beetle with a large payload of mites. Photo by Ian Gethings

Haplotype Networks in R

haplotype_networkAnother grad student in the lab (Kim Tenggardjaja) has been making haplotype networks like the one shown above for some of her research. While she has been calculating haplotype frequencies as well as constructing and plotting the network using the package pegas in R, there was a step in the middle (assigning individual names/sampling locations to their haplotypes) which she was doing tediously by hand in Excel. Some of my recent bumbling in R has led me to see a faster workaround using the packages plyr and reshape. I’m sure there must be a better, easier, more streamlined way of doing this, probably built right into pegas, but this solution struck me as acceptable for now. The networks may come out looking a little wonky using only pegas; I suspect the best way to gussy them up in R would be using the package igraph. The R script I wrote can be found on the resources page and here.

New Habits

In this last stretch of grad school, it seems that new opportunities and projects pop up with increasing frequency, as does the urgency of each task. At the same time, I’m not willing to give up some of the things in my life that are not work; thus, my desire to be as efficient as possible in both capacities has intensified.

If I correctly understand the current culture of academia, if something doesn’t contribute to a line on your CV, it’s not important to your career (or at least, it’s not “work”). And increasingly, if it doesn’t contribute specifically to the publications section of your CV, it’s at best marginally important to job search committees. Thus, unless what you’re doing contributes at least in some small way to a publication, you are wasting your time as far as the pursuit of an academic career is concerned. My intent here is to clearly delineate “work” from “not work”, because I think more efficient people tend to be able to clearly make that distinction and appropriately prioritize their time. Part of the problem, for me at least, is that I love what I do, so work tends to be enjoyable. I also love playing guitar, but it’s clear that’s not work. Reading off-topic scientific papers gets into the gray area. Blogging is also gray area, though I like to think an online presence fosters one’s inclusion in the scientific community.

So, whenever I say “I am working on a project”, I have the manuscript file open. If what I’m doing isn’t at least indirectly contributing to a sentence in that manuscript, then I’m not working on that project. A nice aspect of this is that the manuscript can also function as a to-do list for the project. If I know I’m going to need to test that data fit certain criteria in order to perform a statistical test, I write those sentences, and any time I come back to a project, I know exactly what needs to be done. So far I’ve enjoyed this method, but it’s efficacy won’t be fully validated until I’ve finished grad school!

Morphological Mixup

sebae 8205779846_fbb9cfa458Above: Amphiprion sebae  in Java. From Flickr user Daniel Stoker.

For the vast majority of scientists, open access publishing and data sharing has become all the rage. Many scientific journals require that data such as DNA sequences be submitted to online data repositories for anyone to access. It cannot be emphasized enough how crucial this practice is to scientific progress. The recent leaps and bounds that have been made in resolving evolutionary processes would be impossible without these data resources. For example, a recent study relied on public data to demonstrate that the number of species in a group of organisms is not related to how long ago that group has been around, contrary to some expectations.

However, it’s important to remember that a lot of steps go into generating these DNA sequences, and each one is an opportunity for error to be introduced. Aside from ensuring correct locality data, ID numbers, and guarding against contamination during molecular labwork, a taxonomist must identify what species each specimen belongs to, which is not always such an easy task. Unless the DNA sequence is associated with a specific individual specimen (voucher) or a photo is available, that information is lost forever and subsequent users must rely solely on the identification assigned by whoever collected the tissue sample.

In working on the phylogenetics and population genetics of anemonefish, I’ve been lucky to find quite a bit of DNA sequence data available online. Occasionally, unexpected patterns arise, and ever the skeptic, I want to know whether that aberrance is evidence of a “real” biological pattern or human error. I came across such a pattern, and was able to resolve the issue thanks to the fact that the authors had posted photos of the voucher specimens along with their collection locales on the Marine Barcode of Life website. The photos perfectly match the species with which their DNA most closely resembles (Amphiprion clarkii), and not the given name (Amphiprion sebae).

My intent here is not to rag on these data, but to encourage the practice of providing voucher photos along with those data whenever possible. Public data should not in any way be written off; just taken with some scrutiny and a healthy dose of gratitude for the substantial amount of work behind them.

Below: Amphiprion clarkii in Java. From Flickr user divemecressi.5603328821_d9c487e03c