I’m trying to keep most of my genome stats in bed format. This makes it pretty easy to mix and match analyses. However, one task that seems to come up over and over is how to sort the files when they have a header. So I don’t have to remember it and you don’t have to spend anymore time googling it the easy ‘linux’ way to do this is with the following one-liner:
If you attended Evolution 2013, you probably heard quite a lot of chatter about ultra conserved elements. Essentially, ultra conserved elements (UCEs) are parts of the genome that are highly conserved between different species. Although UCEs carry little phylogenetic information, they are surrounded by increasingly variable flanking sequence (see figure). When combined with their flanking sequence these ‘UCE loci’ make ideal markers to study evolutionary relationships across variable time scales. For example, we have used UCEs identified in birds and reptiles to identify homologous UCE loci in amphibians, birds and reptiles. We have also identified these same UCEs in many published mammal genomes.
An astute reader may wonder how it is possible to obtain UCE loci from taxa without sequenced genomes. The trick, it turns out, lies in the highly conserved UCE sequence. Because of their extreme conservation, it is possible to design complementary biotinylated probes to UCE sequences. These probes can then be used to enrich genomic DNA with a modified sequence capture approach that does not overly shear the genomic DNA thereby preserving the informative flanking sequence. The enriched DNA is then sequenced on a next-generation sequencer (e.g., an Illumina HiSeq) and the ultra-conserved regions are de novo assembled. With current protocols, this can yield close to 5000 unique UCE loci in most amniotes. These loci can then be used to resolve phylogenies from non-model taxa at an unprecedented genomic resolution.
Travis Glenn, Brant Faircloth, and I have been working on this method since 2007, first doing the alignments and identifying the UCE loci. We brought in Kevin Winker, Patty Gowaty, then Rob Brumfield and John McCormack because of our shared interests in avian phylogenomics as well as additional systematics chops. My primary role has been to develop the computational pipeline for inferring individual gene trees from each UCE locus and converting these into a species tree. However, I’m now a Post Doctoral Fellow at the California Academy of Sciences and in collaboration with Brian Simison and Anna Sellas we’re starting to sequence UCEs from a number of taxa. This means I need to run the whole protocol from start to finish, and so I thought I might write a few blog posts that may serve as a guide for other lab groups interested in sequencing and analyzing UCE loci.
We just received our first batch of probes so my next post will be about ordering the necessary materials and tools.
I recently had to design qPCR primers for some genes. I had a genome and an annotated GTF file derived from Cufflinks. Since I wanted the primers to span introns, to prevent the amplification of genomic DNA, I needed a both fasta file of coding sequence to use as input in to Primer3 as well as some associated information about where the introns were spliced out so I could ensure the primers I design spanned introns. So, what I did was use the cufflinks gffread command to convert the GTF file to create set of fasta transcripts annotated with the positions of each of the exons.
Here’s how to do it yourself. If you don’t have cufflinks installed you can download precompiled binaries from here, or if you’re using OS X you can use homebrew + the science tap to install it.
Once you have cufflinks installed type gffread —h to make sure everything is copacetic. Assuming that prints a bunch documentation to the terminal you can then run:
You’ll need to know how your gene or transcript is labeled in the GTF file (e.g. ‘GeneID’). The ‘-x’ flag ensures that the results only include coding sequence, the ‘-W’ adds the exon coordinates to the fasta header. Then run head GeneID.fasta to check that sequences were added to the output file.
One could probably further automate the next steps, but basically you just need to copy the first couple of exons/segs from the file and add brackets around the exon splice site you want the primers to span. You can use the ‘segs’ in the fasta header to determine where the splice junctions are.
The SAM specification now requires @RG tags to be included in all SAM/BAM alignments. If you are using GATK you have probably noticed that it will not run without them. Since @RG tags weren’t standard until recently, I’ve written a script to add them in post hoc. You’ll need to install pysam and python2.7 to get it to work.
I been working with a lot of very large files and it has become increasing obvious that using a single processor core is a major bottleneck to getting my data processed in a timely fashion. A MapReduce style algorithm seemed like the way to go, but I had a hard time finding a useful example. After a bit of hacking about I came up with the following code.
The basic algorithmic idea is to first read in a large chunk of lines from the file. These are then partitioned out to the available cores and processed independently. The new set of lines are then written to an output file or in this example just printed to the screen. Normally this would be tricky code to write, but python 2.7’s wonderful multiprocessing module handles all the synchronization for you.
Bowtie2 is a short read aligner that is optimized for aligning longer reads of lengths of 50 bp or greater. I’ve been playing around with it and was initially puzzled by the fact that it only outputs SAM formated alignments. Then I realized you can pipe the output straight into samtools which will do the compression to BAM for you.
One of the annoying things in OS terminal is that if you want to traverse word by word in a line of text you need to type ‘esc-b’ and ‘exc-f’. This post on macromates, the textmate blog, explains how to reset these keys. Enjoy.