Sorting BED files with headers.

I’m trying to keep most of my genome stats in bed format. This makes it pretty easy to mix and match analyses. However, one task that seems to come up over and over is how to sort the files when they have a header. So I don’t have to remember it and you don’t have to spend anymore time googling it the easy ‘linux’ way to do this is with the following one-liner:

cat file.wheader.bed \
| { head -n 1; tail -n +2 \
| sort -k1,1 -k2,2g; } \
> file.wheader.sorted.bed


Getting started with Ultra Conserved Elements

From Faircloth et al 2012.
From Faircloth et al 2012.

If you attended Evolution 2013, you probably heard quite a lot of chatter about ultra conserved elements. Essentially, ultra conserved elements (UCEs) are parts of the genome that are highly conserved between different species. Although UCEs carry little phylogenetic information, they are surrounded by increasingly variable flanking sequence (see figure). When combined with their flanking sequence these ‘UCE loci’ make ideal markers to study evolutionary relationships across variable time scales. For example, we have used UCEs identified in birds and reptiles to identify homologous UCE loci in amphibians, birds and reptiles. We have also identified these same UCEs in many published mammal genomes.

An astute reader may wonder how it is possible to obtain UCE loci from taxa without sequenced genomes. The trick, it turns out, lies in the highly conserved UCE sequence. Because of their extreme conservation, it is possible to design complementary biotinylated probes to UCE sequences. These probes can then be used to enrich genomic DNA with a modified sequence capture approach that does not overly shear the genomic DNA thereby preserving the informative flanking sequence. The enriched DNA is then sequenced on a next-generation sequencer (e.g., an Illumina HiSeq) and the ultra-conserved regions are de novo assembled. With current protocols, this can yield close to 5000 unique UCE loci in most amniotes. These loci can then be used to resolve phylogenies from non-model taxa at an unprecedented genomic resolution.

Travis Glenn, Brant Faircloth, and I have been working on this method since  2007, first doing the alignments and identifying the UCE loci. We brought in Kevin Winker, Patty Gowaty, then Rob Brumfield and John McCormack because of our shared interests in avian phylogenomics as well as additional systematics chops. My primary role has been to develop the computational pipeline for inferring individual gene trees from each UCE locus and converting these into a species tree. However, I’m now a Post Doctoral Fellow at the California Academy of Sciences and in collaboration with Brian Simison and Anna Sellas we’re starting to sequence UCEs from a number of taxa. This means I need to run the whole protocol from start to finish, and so I thought I might write a few blog posts that may serve as a guide for other lab groups interested in sequencing and analyzing UCE loci.

We just received our first batch of probes so my next post will be about ordering the necessary materials and tools.

Designing qPCR primers from just a GTF/GFF file and a genome sequence

I recently had to design qPCR primers for some genes. I had a genome and an annotated GTF file derived from Cufflinks. Since I wanted the primers to span introns, to prevent the amplification of genomic DNA, I needed a both fasta file of coding sequence to use as input in to Primer3 as well as some associated information about where the introns were spliced out so I could ensure the primers I design spanned introns. So, what I did was use the cufflinks gffread command to convert the GTF file to create set of fasta transcripts annotated with the positions of each of the exons.

Here’s how to do it yourself. If you don’t have cufflinks installed you can download precompiled binaries from here, or if you’re using OS X you can use homebrew + the science tap to install it.

Once you have cufflinks installed type gffread —h to make sure everything is copacetic. Assuming that prints a bunch documentation to the terminal you can then run:

grep 'GeneID' my.gtf | gffread —g my.fasta -W -x GeneID.fasta

You’ll need to know how your gene or transcript is labeled in the GTF file (e.g. ‘GeneID’). The ‘-x’ flag ensures that the results only include coding sequence, the ‘-W’ adds the exon coordinates to the fasta header. Then run head GeneID.fasta to check that sequences were added to the output file.

One could probably further automate the next steps, but basically you just need to copy the first couple of exons/segs from the file and add brackets around the exon splice site you want the primers to span. You can use the ‘segs’ in the fasta header to determine where the splice junctions are.

>ENS0123412 gene=bactin loc:3(-)22053-225439 segs:1-333,334-490,491-518

Lastly, use the modified fasta as input into Primer3. You’ll need to adjust the PCR product size to range from 70-200 base-pairs, but the default TM of 60 should be fine.

The final result should look something like this:

PRIMER PICKING RESULTS FOR ENS0123412 gene=bactin loc:3(-)22053-225439 segs:1-333,334-490,491-518

No mispriming library specified
Using 1-based sequence positions
OLIGO            start  len      tm     gc%   any    3' seq 
LEFT PRIMER        275   20   59.83   50.00  8.00  0.00 TGAATGAGGCCTCTGTGTTG
RIGHT PRIMER       450   20   60.03   55.00  5.00  3.00 TAGATGTGAGGGGGCATCTC

TARGETS (start, len)*: 328,13










Filtering contigs/chromosomes from a multi-fasta file

A colleague needed to remove some individual fastas from a multi-fasta file. Googling didn’t reveal a canned way to do it so I hacked up this script.

8.29.12 – As Jason Gallant pointed out, if your fasta is very small you don’t need to index your fasta file. Just  use the simple biopython code he mentions in the comments.

Python: Adding Read Group (@RG) tags to BAM or SAM files

The SAM specification now requires @RG tags to be included in all SAM/BAM alignments. If you are using GATK you have probably noticed that it will not run without them. Since @RG tags weren’t standard until recently, I’ve written a script to add them  in post hoc. You’ll need to install pysam and python2.7 to get it to work.

Python: Multiprocessing large files

I been working with a lot of very large files and it has become increasing obvious that using a single processor core  is a major bottleneck to getting my data processed in a timely fashion. A MapReduce style algorithm seemed like the way to go, but I had a hard time finding a useful example. After a bit of hacking about I came up with the following code.

The basic algorithmic idea is to first read in a large chunk of lines from the file. These are then partitioned out to the available cores and processed independently.  The new set of lines are then written to an output file or in this example just printed to the screen. Normally this would be tricky code to write, but python 2.7’s wonderful multiprocessing module handles all the synchronization for you.

Python: Interleave Paired-End Reads

Here’s a simple script for interleaving paired-end fastq files. You’ll need to do this if you want to create input files for velvet.  It requires python 2.7.

Bowtie2 output as BAM

Bowtie2 is a short read aligner that is optimized for aligning longer reads of lengths of 50 bp or greater. I’ve been playing around with it and was initially puzzled by the fact that it only outputs SAM formated alignments. Then I realized you can pipe the output straight into samtools which will do the compression to BAM for you.

$ bowtie2 \
-p 4 \
-x /genome/index \
-1 pair2.fastq \
-2 pair2.fastq \
-U unpaired.fastq \
--very-sensitive \
-X 1000 \
-I 200 \
| samtools view -bS - > output.bam