I recently had to design qPCR primers for some genes. I had a genome and an annotated GTF file derived from Cufflinks. Since I wanted the primers to span introns, to prevent the amplification of genomic DNA, I needed a both fasta file of coding sequence to use as input in to Primer3 as well as some associated information about where the introns were spliced out so I could ensure the primers I design spanned introns. So, what I did was use the cufflinks gffread command to convert the GTF file to create set of fasta transcripts annotated with the positions of each of the exons.

Here’s how to do it yourself. If you don’t have cufflinks installed you can download precompiled binaries from here, or if you’re using OS X you can use homebrew + the science tap to install it.

Once you have cufflinks installed type gffread –h to make sure everything is copacetic. Assuming that prints a bunch documentation to the terminal you can then run:

grep 'GeneID' my.gtf | gffread –g my.fasta -W -x GeneID.fasta

You’ll need to know how your gene or transcript is labeled in the GTF file (e.g. ‘GeneID’). The ‘-x’ flag ensures that the results only include coding sequence, the ‘-W’ adds the exon coordinates to the fasta header. Then run head GeneID.fasta to check that sequences were added to the output file.

One could probably further automate the next steps, but basically you just need to copy the first couple of exons/segs from the file and add brackets around the exon splice site you want the primers to span. You can use the ‘segs’ in the fasta header to determine where the splice junctions are.

>ENS0123412 gene=bactin loc:3(-)22053-225439 segs:1-333,334-490,491-518
ATGGCTCAGAGAGATGCTGACAAATACCTCTATGTGGATAGAAATCTCATCAACAACCCTCTTGCTCAGG
CCGATTGGGCAGCTAAGAAACTGGTGTGGGTCCCATCAGAAAAGAATGGCTTTGAGCCTGCTAGCTTAAA
AGAGGAAGTAGGAGATGAAGCCATTGTGGAGCTTGCAGAGAACGGGAAGAAAGTGCGAGTAAACAAAGAT
GATATCCAAAAGATGAACCCGCCTAAGTTCTCTAAAGTGGAAGACATGGCTGAATTGACCTGCCTGAATG
AGGCCTCTGTGTTGCACAACTTAAAGGAACGATACTACTCGGGGCTT[ATCTATACCTACT]CAGGCCTA
CTGTGTGGTCATAAATCCCTACAAGAACTTGCCCATCTACTCAGAAGAGATTGTGGAAATGTATAAGGGC
AAAAAGAGACACGAGATGCCCCCTCACATCTATGCCATTACAGACACAGCCTACAGGAGTATGATGCAAG

Lastly, use the modified fasta as input into Primer3. You’ll need to adjust the PCR product size to range from 70-200 base-pairs, but the default TM of 60 should be fine.

The final result should look something like this:

PRIMER PICKING RESULTS FOR ENS0123412 gene=bactin loc:3(-)22053-225439 segs:1-333,334-490,491-518

No mispriming library specified
Using 1-based sequence positions
OLIGO            start  len      tm     gc%   any    3' seq 
LEFT PRIMER        275   20   59.83   50.00  8.00  0.00 TGAATGAGGCCTCTGTGTTG
RIGHT PRIMER       450   20   60.03   55.00  5.00  3.00 TAGATGTGAGGGGGCATCTC
SEQUENCE SIZE: 488
INCLUDED REGION SIZE: 488

PRODUCT SIZE: 176, PAIR ANY COMPL: 4.00, PAIR 3' COMPL: 2.00
TARGETS (start, len)*: 328,13

    1 ATGGCTCAGAGAGATGCTGACAAATACCTCTATGTGGATAGAAATCTCATCAACAACCCT
                                                                  

   61 CTTGCTCAGGCCGATTGGGCAGCTAAGAAACTGGTGTGGGTCCCATCAGAAAAGAATGGC
                                                                  

  121 TTTGAGCCTGCTAGCTTAAAAGAGGAAGTAGGAGATGAAGCCATTGTGGAGCTTGCAGAG
                                                                  

  181 AACGGGAAGAAAGTGCGAGTAAACAAAGATGATATCCAAAAGATGAACCCGCCTAAGTTC
                                                                  

  241 TCTAAAGTGGAAGACATGGCTGAATTGACCTGCCTGAATGAGGCCTCTGTGTTGCACAAC
                                        >>>>>>>>>>>>>>>>>>>>      

  301 TTAAAGGAACGATACTACTCGGGGCTTATCTATACCTACTCAGGCCTACTGTGTGGTCAT
                                 *************                    

  361 AAATCCCTACAAGAACTTGCCCATCTACTCAGAAGAGATTGTGGAAATGTATAAGGGCAA
                                                                  

  421 AAAGAGACACGAGATGCCCCCTCACATCTATGCCATTACAGACACAGCCTACAGGAGTAT
                <<<<<<<<<<<<<<<<<<<<                              

  481 GATGCAAG

A colleague needed to remove some individual fastas from a multi-fasta file. Googling didn’t reveal a canned way to do it so I hacked up this script.

8.29.12 – As Jason Gallant pointed out, if your fasta is very small you don’t need to index your fasta file. Just  use the simple biopython code he mentions in the comments.

The SAM specification now requires @RG tags to be included in all SAM/BAM alignments. If you are using GATK you have probably noticed that it will not run without them. Since @RG tags weren’t standard until recently, I’ve written a script to add them  in post hoc. You’ll need to install pysam and python2.7 to get it to work.

I been working with a lot of very large files and it has become increasing obvious that using a single processor core  is a major bottleneck to getting my data processed in a timely fashion. A MapReduce style algorithm seemed like the way to go, but I had a hard time finding a useful example. After a bit of hacking about I came up with the following code.

The basic algorithmic idea is to first read in a large chunk of lines from the file. These are then partitioned out to the available cores and processed independently.  The new set of lines are then written to an output file or in this example just printed to the screen. Normally this would be tricky code to write, but python 2.7′s wonderful multiprocessing module handles all the synchronization for you.

Here’s a simple script for interleaving paired-end fastq files. You’ll need to do this if you want to create input files for velvet. Unlike the velvet’s shuffleSequences_fastq.pl perl script, this script handles gzipped input and output. It requires python 2.7.

Bowtie2 is a short read aligner that is optimized for aligning longer reads of lengths of 50 bp or greater. I’ve been playing around with it and was initially puzzled by the fact that it only outputs SAM formated alignments. Then I realized you can pipe the output straight into samtools which will do the compression to BAM for you.

$ bowtie2 \
-p 4 \
-x /genome/index \
-1 pair2.fastq \
-2 pair2.fastq \
-U unpaired.fastq \
--very-sensitive \
-X 1000 \
-I 200 \
| samtools view -bS - > output.bam

To install ABySS on an system running an older version of gcc and use the following commands.

>./configure –enable-maxk=96 –disable-openmp \

CPPFLAGS=-I<path to google-sparsehash install>/include \

–prefix=<ABySS install directory>/

>make AM_CXXFLAGS=-Wall

>make install

One of the annoying things in OS terminal is that if you want to traverse word by word in a line of text you need to type ‘esc-b’ and ‘exc-f’.  This post on macromates, the textmate blog, explains how to reset these keys. Enjoy.

Since I started using github in a serious way back in January I’ve begun writing my documentation in the  markdown format that displays so nicely on github. Markdown is essentially a parsing tool and a simple text syntax that allows the easy conversion of human ‘readable text’ to html. It’s intuitive, it took less than 5 minutes to pick up, and saves me a ton of time not writing HTML. However, its ease of use is tempered, a bit, by a lack of features.  Although it is easy to create headers, lists, and code bocks – simple HTML stuff – it doesn’t include the option to create tables, formated mathematical formulas, citations and bibliographies.  Since I’m a scientist who wants to produce documents with these sorts of features, this is annoying.

Luckily, the markdown syntax has recently been extended, in a project called MultiMarkdown, to include many of the aforementioned features. Multimarkdown essentially merges the markdown syntax with LaTeX which, if you haven’t heard of it, is a rather inscrutable, but extremely powerful text formatting language.  It’s popular in the CS and physics disciplines.  LaTeX produces beautiful documents, but it’s easy to spend a week or more adjusting the formatting and reading the API trying to figure out some of the more complicated features.  Multimarkdown looks like it will do much of the more basic LaTeX formatting, but without the headache.

Welcome to Vertebrate Zoology lab (Bi302 Lab).  I’ll be using this portion of my website to post notes/slides and to answer your questions.  Please make use of the comments. More to come.