<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Nick Crawford</title>
	<atom:link href="http://www.ngcrawford.com/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.ngcrawford.com</link>
	<description>Evolution and more...</description>
	<lastBuildDate>Thu, 25 Apr 2013 15:22:46 +0000</lastBuildDate>
	<language>en-US</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.5.1</generator>
		<item>
		<title>Designing qPCR primers from just a GTF/GFF file and a genome sequence</title>
		<link>http://www.ngcrawford.com/2013/04/03/designing-qpcr-primers-from-at-gtfgff-file-and-a-genome/</link>
		<comments>http://www.ngcrawford.com/2013/04/03/designing-qpcr-primers-from-at-gtfgff-file-and-a-genome/#comments</comments>
		<pubDate>Wed, 03 Apr 2013 14:07:38 +0000</pubDate>
		<dc:creator>Nick</dc:creator>
				<category><![CDATA[Blog]]></category>

		<guid isPermaLink="false">http://www.ngcrawford.com/?p=641</guid>
		<description><![CDATA[I recently had to design qPCR primers for some genes. I had a genome and an annotated GTF file derived from Cufflinks. Since I wanted the primers to span introns, to prevent the amplification of genomic DNA, I needed a both fasta file of coding sequence to use as input in to Primer3 as well [...]]]></description>
				<content:encoded><![CDATA[<p>I recently had to design qPCR primers for some genes. I had a genome and an annotated <a href="http://mblab.wustl.edu/GTF22.html" target="_blank">GTF</a> file derived from Cufflinks. Since I wanted the primers to span introns, to prevent the amplification of genomic DNA, I needed a both fasta file of coding sequence to use as input in to <a href="http://frodo.wi.mit.edu/" target="_blank">Primer3</a> as well as some associated information about where the introns were spliced out so I could ensure the primers I design spanned introns. So, what I did was use the cufflinks gffread command to convert the GTF file to create set of fasta transcripts annotated with the positions of each of the exons.</p>
<p>Here&#8217;s how to do it yourself. If you don’t have cufflinks installed you can download precompiled binaries from <a href="http://cufflinks.cbcb.umd.edu/" target="_blank">here</a>, or if you’re using OS X you can use <a href="http://mxcl.github.com/homebrew/" target="_blank">homebrew</a> + the <a href="https://github.com/Homebrew/homebrew-science" target="_blank">science tap</a> to install it.</p>
<p>Once you have cufflinks installed type <code>gffread –h</code> to make sure everything is copacetic. Assuming that prints a bunch documentation to the terminal you can then run:</p>
<p><code>grep 'GeneID' my.gtf | gffread –g my.fasta -W -x GeneID.fasta</code></p>
<p>You’ll need to know how your gene or transcript is labeled in the GTF file (e.g. ‘GeneID’). The ‘-x’ flag ensures that the results only include coding sequence, the &#8216;-W&#8217; adds the exon coordinates to the fasta header. Then run <code>head GeneID.fasta</code> to check that sequences were added to the output file.</p>
<p>One could probably further automate the next steps, but basically you just need to copy the first couple of exons/segs from the file and add brackets around the exon splice site you want the primers to span. You can use the &#8216;segs&#8217; in the fasta header to determine where the splice junctions are.</p>
<pre><code>&gt;ENS0123412 gene=bactin loc:3(-)22053-225439 segs:1-333,334-490,491-518
ATGGCTCAGAGAGATGCTGACAAATACCTCTATGTGGATAGAAATCTCATCAACAACCCTCTTGCTCAGG
CCGATTGGGCAGCTAAGAAACTGGTGTGGGTCCCATCAGAAAAGAATGGCTTTGAGCCTGCTAGCTTAAA
AGAGGAAGTAGGAGATGAAGCCATTGTGGAGCTTGCAGAGAACGGGAAGAAAGTGCGAGTAAACAAAGAT
GATATCCAAAAGATGAACCCGCCTAAGTTCTCTAAAGTGGAAGACATGGCTGAATTGACCTGCCTGAATG
AGGCCTCTGTGTTGCACAACTTAAAGGAACGATACTACTCGGGGCTT<strong>[<em>ATCTATACCTACT</em>]</strong>CAGGCCTA
CTGTGTGGTCATAAATCCCTACAAGAACTTGCCCATCTACTCAGAAGAGATTGTGGAAATGTATAAGGGC
AAAAAGAGACACGAGATGCCCCCTCACATCTATGCCATTACAGACACAGCCTACAGGAGTATGATGCAAG</code></pre>
<p>Lastly, use the modified fasta as input into <a href="http://frodo.wi.mit.edu/" target="_blank">Primer3</a>. You&#8217;ll need to adjust the PCR product size to range from 70-200 base-pairs, but the default TM of 60 should be fine.</p>
<p>The final result should look something like this:</p>
<pre><code>PRIMER PICKING RESULTS FOR ENS0123412 gene=bactin loc:3(-)22053-225439 segs:1-333,334-490,491-518

No mispriming library specified
Using 1-based sequence positions
OLIGO            start  len      tm     gc%   any    3' seq 
LEFT PRIMER        275   20   59.83   50.00  8.00  0.00 TGAATGAGGCCTCTGTGTTG
RIGHT PRIMER       450   20   60.03   55.00  5.00  3.00 TAGATGTGAGGGGGCATCTC
SEQUENCE SIZE: 488
INCLUDED REGION SIZE: 488

PRODUCT SIZE: 176, PAIR ANY COMPL: 4.00, PAIR 3' COMPL: 2.00
TARGETS (start, len)*: 328,13

    1 ATGGCTCAGAGAGATGCTGACAAATACCTCTATGTGGATAGAAATCTCATCAACAACCCT
                                                                  

   61 CTTGCTCAGGCCGATTGGGCAGCTAAGAAACTGGTGTGGGTCCCATCAGAAAAGAATGGC
                                                                  

  121 TTTGAGCCTGCTAGCTTAAAAGAGGAAGTAGGAGATGAAGCCATTGTGGAGCTTGCAGAG
                                                                  

  181 AACGGGAAGAAAGTGCGAGTAAACAAAGATGATATCCAAAAGATGAACCCGCCTAAGTTC
                                                                  

  241 TCTAAAGTGGAAGACATGGCTGAATTGACCTGCCTGAATGAGGCCTCTGTGTTGCACAAC
                                        >>>>>>>>>>>>>>>>>>>>      

  301 TTAAAGGAACGATACTACTCGGGGCTTATCTATACCTACTCAGGCCTACTGTGTGGTCAT
                                 *************                    

  361 AAATCCCTACAAGAACTTGCCCATCTACTCAGAAGAGATTGTGGAAATGTATAAGGGCAA
                                                                  

  421 AAAGAGACACGAGATGCCCCCTCACATCTATGCCATTACAGACACAGCCTACAGGAGTAT
                <<<<<<<<<<<<<<<<<<<<                              

  481 GATGCAAG</code></pre>
]]></content:encoded>
			<wfw:commentRss>http://www.ngcrawford.com/2013/04/03/designing-qpcr-primers-from-at-gtfgff-file-and-a-genome/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Filtering contigs/chromosomes from a multi-fasta file</title>
		<link>http://www.ngcrawford.com/2012/07/31/filtering-contigschromosomes-from-a-multi-fasta-file/</link>
		<comments>http://www.ngcrawford.com/2012/07/31/filtering-contigschromosomes-from-a-multi-fasta-file/#comments</comments>
		<pubDate>Tue, 31 Jul 2012 13:46:47 +0000</pubDate>
		<dc:creator>Nick</dc:creator>
				<category><![CDATA[Blog]]></category>
		<category><![CDATA[Software]]></category>
		<category><![CDATA[fasta]]></category>
		<category><![CDATA[pysam]]></category>
		<category><![CDATA[python]]></category>
		<category><![CDATA[samtools]]></category>

		<guid isPermaLink="false">http://www.ngcrawford.com/?p=544</guid>
		<description><![CDATA[A colleague needed to remove some individual fastas from a multi-fasta file. Googling didn&#8217;t reveal a canned way to do it so I hacked up this script. 8.29.12 &#8211; As Jason Gallant pointed out, if your fasta is very small you don&#8217;t need to index your fasta file. Just  use the simple biopython code he mentions in [...]]]></description>
				<content:encoded><![CDATA[<p>A colleague needed to remove some individual fastas from a multi-fasta file. Googling didn&#8217;t reveal a canned way to do it so I hacked up this script.</p>
<p>8.29.12 &#8211; As Jason Gallant pointed out, if your fasta is very small you don&#8217;t need to index your fasta file. Just  use the simple biopython code he mentions in the comments.</p>
<script src="https://gist.github.com/3217022.js"></script><noscript><pre><code class="language-python python">#!/usr/bin/env python
# encoding: utf-8

&quot;&quot;&quot;
removeFaFromFasta.py

Created by Nick Crawford on 2012-31-07.

The author may be contacted at ngcrawford@gmail.com

Requires samtools and pysam to be installed.
http://samtools.sourceforge.net/
http://code.google.com/p/pysam/
&quot;&quot;&quot;

import os
import pysam
import argparse

def get_args():
  &quot;&quot;&quot;Parse sys.argv&quot;&quot;&quot;
  parser = argparse.ArgumentParser()

  parser.add_argument('-fa','--fasta', required=True,
            help='Path to fasta file to filter.')

  parser.add_argument('-o','--output', required=True,
            help='Path to filtered fasta file.')

  parser.add_argument('-l', '--filter-list', required=True, nargs='+',
            help='List of of chrm/contig names to remove.')

  args = parser.parse_args()
  return args


def splitIterator(text, size):
  &quot;Iterator that splits string into list of substrings.&quot;
  assert size &gt; 0, &quot;size should be &gt; 0&quot;
  for start in xrange(0, len(text), size):
    yield text[start:start + size]

def main(args):
  # Setup files
  fasta = args.fasta
  filtered_fa = args.output
  faidx = args.fasta + '.fai'

  # Make sure .fai exists
  try:
    os.path.exists(faidx)
  except:
    print &quot;You need to index the fasta file with samtools.\n {} does not exist.&quot;.format(faidx)    
  
  # Filter names for contings
  bad_names = args.filter_list
  chrm_names = (line.strip().split()[0] for line in open(faidx,'rU'))
  filtered_chrm_names = (cn for cn in chrm_names if cn not in bad_names)

  # Write contigs/chrms to output fasta
  fasta = pysam.Fastafile(fasta)
  filtered_fa = open(filtered_fa,'w')
  
  for name in filtered_chrm_names:
  
    print &quot;Processed:&quot;, name
    chrm = fasta.fetch(name)
    filtered_fa.write(&quot;&gt;&quot; + name + &quot;\n&quot;)
  
    # split lines on 80 characters
    [filtered_fa.write(chars + &quot;\n&quot; ) for chars in splitIterator(chrm, 80)]

  # Clean up open files
  filtered_fa.close() 

if __name__ == '__main__':
  args = get_args()
        main(args)
</code></pre></noscript>
]]></content:encoded>
			<wfw:commentRss>http://www.ngcrawford.com/2012/07/31/filtering-contigschromosomes-from-a-multi-fasta-file/feed/</wfw:commentRss>
		<slash:comments>2</slash:comments>
		</item>
		<item>
		<title>Python: Adding Read Group (@RG) tags to BAM or SAM files</title>
		<link>http://www.ngcrawford.com/2012/04/17/python-adding-read-group-rg-tags-to-bam-or-sam-files/</link>
		<comments>http://www.ngcrawford.com/2012/04/17/python-adding-read-group-rg-tags-to-bam-or-sam-files/#comments</comments>
		<pubDate>Tue, 17 Apr 2012 16:44:11 +0000</pubDate>
		<dc:creator>Nick</dc:creator>
				<category><![CDATA[Blog]]></category>
		<category><![CDATA[Software]]></category>
		<category><![CDATA[@RG]]></category>
		<category><![CDATA[BAM]]></category>
		<category><![CDATA[GATK]]></category>
		<category><![CDATA[pysam]]></category>
		<category><![CDATA[SAM]]></category>

		<guid isPermaLink="false">http://www.ngcrawford.com/?p=506</guid>
		<description><![CDATA[The SAM specification now requires @RG tags to be included in all SAM/BAM alignments. If you are using GATK you have probably noticed that it will not run without them. Since @RG tags weren&#8217;t standard until recently, I&#8217;ve written a script to add them  in post hoc. You&#8217;ll need to install pysam and python2.7 to get [...]]]></description>
				<content:encoded><![CDATA[<p>The <a title="Link to PDF" href="http://samtools.sourceforge.net/SAM1.pdf" target="_blank">SAM specification</a> now requires @RG tags to be included in all SAM/BAM alignments. If you are using <a title="GATK" href="http://www.broadinstitute.org/gsa/wiki/index.php/Home_Page" target="_blank">GATK</a> you have probably noticed that it will not run without them. Since @RG tags weren&#8217;t standard until recently, I&#8217;ve written a script to add them  in post hoc. You&#8217;ll need to install <a title="pysam" href="http://code.google.com/p/pysam/" target="_blank">pysam</a> and <a title="Python2.7" href="http://www.python.org/getit/releases/2.7/" target="_blank">python2.7</a> to get it to work.</p>
<script src="https://gist.github.com/2407317.js"></script><noscript><pre><code class="language-python python">import os
import sys
import glob
import pysam
import argparse
import multiprocessing

def get_args():
    '''Parse sys.argv'''
    parser = argparse.ArgumentParser()
    parser.add_argument('--cores', type=int,
                        help='the number of avalible processor cores')
    parser.add_argument('-i','--input-dir', required=True,
                        help='The input directory containing the bam files.')
    parser.add_argument('-CN',type=str,
                        help=&quot;Name of sequencing center producing the read. \
                        GATK not required.&quot;)
    parser.add_argument('-DS',type=str,
                        help=&quot;Description. GATK Not Required. &quot;)
    parser.add_argument('-DT',type=str, required=True,
                        help=&quot;Date the run was produced (ISO8601 date or date/time). \
                        GATK Not Required. &quot;)
    parser.add_argument('-PI',type=int, required=True,
                        help=&quot;Predicted median insert size. GATK Not Required.&quot;)
    parser.add_argument('-PL',type=str, required=True,
                        choices = ['CAPILLARY', 'LS454', 'ILLUMINA', 'SOLID', 
                                    'HELICOS', 'IONTORRENT', 'PACBIO'],
                        help=&quot;Platform/technology used to produce the reads.&quot;)

    args = parser.parse_args()
    return args

def addRG2Header(filename, sam_info, args):
    &quot;&quot;&quot;Add read group info to a header.&quot;&quot;&quot;
    # CREATE TEMPLATE
    # Read group. Unordered multiple @RG lines are allowed.
    RG_template = { 'ID': '',           # Read group identifier. e.g., Illumina flowcell + lane name and number
                    'CN': '',           # GATK Not Required. Name of sequencing center producing the read.
                    'DS': '',           # GATK Not Required. Description
                    'DT': '',           # GATK Not Required. Date the run was produced (ISO8601 date YYYY-MM-DD or YYYYMMDD)
                    'PI': '',           # GATK Not Required. Predicted median insert size.
                    'PU': '',           # GATK Not Required. Platform unit (e.g. flowcell-barcode.lane for Illumina or slide for SOLiD).
                    'SM': '',           # Sample. Use pool name where a pool is being sequenced.
                    'PL': 'ILLUMINA'}   # Platform/technology used to produce the reads.

    samfile = pysam.Samfile(filename, 'r')
    new_header = samfile.header.copy()
    samfile.close()

    # ADD INFO TO TEMPLATE
    RG_template = RG_template.copy()
    RG_template['ID'] = sam_info[&quot;sample_name&quot;]
    if args.CN: RG_template['CN'] = args.CN.upper()
    if args.DS: RG_template['DS'] = args.DS
    RG_template['DT'] = args.DT
    RG_template['LB'] = sam_info[&quot;sample_name&quot;]
    RG_template['SM'] = sam_info[&quot;sample_name&quot;]
    RG_template['DS'] = &quot;{0}.{1}&quot;.format(sam_info['sample_name'], sam_info['locality'])
    RG_template['PI'] = args.PI
    RG_template['PU'] = '{0}.{1}'.format(sam_info['flowcell_id'], sam_info['lane'])
    new_header['RG'] = [RG_template]
    return new_header


def add_RGs_2_BAMs_runner(data):
    &quot;&quot;&quot;Generates the correct @RG header and adds a RG field to a bam file.&quot;&quot;&quot;
    # Make Bam of Sam if it doesn't exist
    filename, new_RG_header = data
    if filename.endswith('sam'):
        sam_handle = pysam.Samfile(filename)
        bam_name = os.path.splitext(filename)[0]+&quot;.bam&quot;
        bam_handle = pysam.Samfile( bam_name, &quot;wb&quot;, template = sam_handle )
        for s in sam_handle:
            bam_handle.write(s)
        filename = bam_name

    # Massage paths and make outputfiles
    pysam.sort(filename, os.path.splitext(filename)[0]+&quot;.sorted&quot;)
    pysam.index(os.path.splitext(filename)[0]+&quot;.sorted.bam&quot;)
    filename = os.path.splitext(filename)[0]+&quot;.sorted.bam&quot;
    path, filename = os.path.split(filename)
    name, ext = os.path.splitext(filename)
    new_name = name + '.wRG.' + 'bam'
    outfile_name =  os.path.join(path,new_name)
    outfile = pysam.Samfile( outfile_name, 'wb', header = new_RG_header )

    # Step 2: Process Samfile adding Read Group to Each Read
    samfile = pysam.Samfile(os.path.join(path, filename))
    samfile.fetch()
    for count, read in enumerate(samfile.fetch()):
        name = read.qname
        read_group = os.path.split(filename)[1].split(&quot;.&quot;)[0]
        new_tags = read.tags
        new_tags.append(('RG', read_group))
        read.tags = new_tags
        outfile.write(read)
    outfile.close()

    # Step 3: Make index of read group enabled samfile
    pysam.index(outfile_name)
    sys.stdout.write(&quot;.&quot;)
    sys.stdout.flush()
    return

def parseFileName(filepath):
    path, filename = os.path.split(filepath)
    sample_name, locality, inline_tag, third_read_tag = filename.split(&quot;.&quot;)[:-1]
    sam_handle = pysam.Samfile(filepath,'r')
    sam_line = sam_handle.next()
    read_info = sam_line.qname
    sam_handle.close()
    instrument, run_id, flowcell_id, lane = read_info.split(&quot;:&quot;)[:4]
    info = {'sample_name': sample_name,
            'locality': locality,
            'inline_tag': inline_tag,
            'third_read_tag': third_read_tag,
            'instrument':instrument,  
            'run_id': run_id,
            'flowcell_id': flowcell_id,
            'lane': lane}
    return info

def add_RGs_2_BAMs(pool, args):
    data_for_map = []
    print 'Making RG headers.'
    for count, filename in enumerate(glob.glob(os.path.join(args.input_dir,'*'))):
        if filename.endswith('sam') or filename.endswith('bam'):
            sam_info = parseFileName(filename)
            new_RG_header = addRG2Header(filename, sam_info, args)
            data_for_map.append([filename, new_RG_header])

    sys.stdout.write(&quot;\nAdding RGs and making BAMs&quot;)
    sys.stdout.flush()
    pool.map(add_RGs_2_BAMs_runner, data_for_map)
    return

def main():
    args = get_args()
    cores = args.cores
    pool = multiprocessing.Pool(args.cores)
    add_RGs_2_BAMs(pool, args)

if __name__ == '__main__':
    main()
</code></pre></noscript>
]]></content:encoded>
			<wfw:commentRss>http://www.ngcrawford.com/2012/04/17/python-adding-read-group-rg-tags-to-bam-or-sam-files/feed/</wfw:commentRss>
		<slash:comments>1</slash:comments>
		</item>
		<item>
		<title>Python: Multiprocessing large files</title>
		<link>http://www.ngcrawford.com/2012/03/29/python-multiprocessing-large-files/</link>
		<comments>http://www.ngcrawford.com/2012/03/29/python-multiprocessing-large-files/#comments</comments>
		<pubDate>Thu, 29 Mar 2012 13:08:43 +0000</pubDate>
		<dc:creator>Nick</dc:creator>
				<category><![CDATA[Blog]]></category>
		<category><![CDATA[big data]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[multiprocessing]]></category>
		<category><![CDATA[python]]></category>

		<guid isPermaLink="false">http://www.ngcrawford.com/?p=466</guid>
		<description><![CDATA[I been working with a lot of very large files and it has become increasing obvious that using a single processor core  is a major bottleneck to getting my data processed in a timely fashion. A MapReduce style algorithm seemed like the way to go, but I had a hard time finding a useful example. [...]]]></description>
				<content:encoded><![CDATA[<p>I been working with a lot of very large files and it has become increasing obvious that using a single processor core  is a major bottleneck to getting my data processed in a timely fashion. A <a href="http://en.wikipedia.org/wiki/MapReduce" target="_blank">MapReduce</a> style algorithm seemed like the way to go, but I had a hard time finding a useful example. After a bit of hacking about I came up with the following code.</p>
<p>The basic algorithmic idea is to first read in a large chunk of lines from the file. These are then partitioned out to the available cores and processed independently.  The new set of lines are then written to an output file or in this example just printed to the screen. Normally this would be tricky code to write, but python 2.7&#8242;s wonderful <a href="http://docs.python.org/library/multiprocessing.html">multiprocessing module</a> handles all the synchronization for you.</p>
<script src="https://gist.github.com/2237170.js"></script><noscript><pre><code class="language-python python">#!/usr/bin/env python
# encoding: utf-8

import multiprocessing
from textwrap import dedent
from itertools import izip_longest

def process_chunk(d):
	&quot;&quot;&quot;Replace this with your own function
	that processes data one line at a
	time&quot;&quot;&quot;

	d = d.strip() + ' processed'
	return d 

def grouper(n, iterable, padvalue=None):
	&quot;&quot;&quot;grouper(3, 'abcdefg', 'x') --&gt;
	('a','b','c'), ('d','e','f'), ('g','x','x')&quot;&quot;&quot;

	return izip_longest(*[iter(iterable)]*n, fillvalue=padvalue)

if __name__ == '__main__':

	# test data
	test_data = &quot;&quot;&quot;\
	1 some test garbage
	2 some test garbage
	3 some test garbage
	4 some test garbage
	5 some test garbage
	6 some test garbage
	7 some test garbage
	8 some test garbage
	9 some test garbage
	10 some test garbage
	11 some test garbage
	12 some test garbage
	13 some test garbage
	14 some test garbage
	15 some test garbage
	16 some test garbage
	17 some test garbage
	18 some test garbage
	19 some test garbage
	20 some test garbage&quot;&quot;&quot;
	test_data = dedent(test_data)
	test_data = test_data.split(&quot;\n&quot;)

	# Create pool (p)
	p = multiprocessing.Pool(4)

	# Use 'grouper' to split test data into
	# groups you can process without using a
	# ton of RAM. You'll probably want to 
	# increase the chunk size considerably
	# to something like 1000 lines per core.

	# The idea is that you replace 'test_data'
	# with a file-handle
	# e.g., testdata = open(file.txt,'rU')

	# And, you'd write to a file instead of
	# printing to the stout

	for chunk in grouper(10, test_data):
		results = p.map(process_chunk, chunk)
		for r in results:
			print r 	# replace with outfile.write()</code></pre></noscript>
]]></content:encoded>
			<wfw:commentRss>http://www.ngcrawford.com/2012/03/29/python-multiprocessing-large-files/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Python: Interleave Paired-End Reads</title>
		<link>http://www.ngcrawford.com/2012/03/28/interleave-paired-end-reads/</link>
		<comments>http://www.ngcrawford.com/2012/03/28/interleave-paired-end-reads/#comments</comments>
		<pubDate>Thu, 29 Mar 2012 03:10:54 +0000</pubDate>
		<dc:creator>Nick</dc:creator>
				<category><![CDATA[Blog]]></category>
		<category><![CDATA[fastq]]></category>
		<category><![CDATA[next-gen sequencing]]></category>
		<category><![CDATA[python]]></category>

		<guid isPermaLink="false">http://www.ngcrawford.com/?p=472</guid>
		<description><![CDATA[Here&#8217;s a simple script for interleaving paired-end fastq files. You&#8217;ll need to do this if you want to create input files for velvet. Unlike the velvet&#8217;s shuffleSequences_fastq.pl perl script, this script handles gzipped input and output. It requires python 2.7.]]></description>
				<content:encoded><![CDATA[<p>Here&#8217;s a simple script for interleaving paired-end fastq files. You&#8217;ll need to do this if you want to create input files for <a title="velvet" href="http://www.ebi.ac.uk/~zerbino/velvet/">velvet</a>. Unlike the velvet&#8217;s <em>shuffleSequences_fastq.pl</em> perl script, this script handles gzipped input and output. It requires <a title="Python 2.7" href="http://www.python.org/getit/releases/2.7/" target="_blank">python 2.7</a>.</p>
<script src="https://gist.github.com/2232505.js"></script><noscript><pre><code class="language-python python">#!/usr/bin/env python
# encoding: utf-8

import gzip
import argparse

def interface():
	args = argparse.ArgumentParser()
	args.add_argument('-l', '--left-input',
		help='The first input file in fastq format.')

	args.add_argument('-r', '--right-input', 
		help='The first input file in fastq format.')

	args.add_argument('-o', '--output', 
		help='The output file in fastq format.')
	
	args = args.parse_args()
	return args


def process_reads(args):
	
	if args.left_input.endswith('.gz') or args.right_input.endswith('.gz'):

		left = gzip.open(args.left_input,'rb')
		right = gzip.open(args.right_input,'rb')
		fout = gzip.open(args.output,'wb')

	else:
		left = open(args.left_input,'rU')
		right = open(args.right_input,'rU')
		fout = open(args.output,'wb')


	# USING A WHILE LOOP MAKE THIS SUPER FAST
	# Details here: 
	#   http://effbot.org/zone/readline-performance.htm
	
	while 1: 

		# process the first file
		left_line = left.readline()
		if not left_line: break
		fout.write(left_line)

		left_line = left.readline()
		fout.write(left_line)

		left_line = left.readline()
		fout.write(left_line)

		left_line = left.readline()
		fout.write(left_line)


		# process the second file
		right_line = right.readline()
		fout.write(right_line)

		right_line = right.readline()
		fout.write(right_line)

		right_line = right.readline()
		fout.write(right_line)

		right_line = right.readline()
		fout.write(right_line)

	left.close()
	right.close()
	fout.close()
	return 0

if __name__ == '__main__':
	args = interface()
	process_reads(args)</code></pre></noscript>
]]></content:encoded>
			<wfw:commentRss>http://www.ngcrawford.com/2012/03/28/interleave-paired-end-reads/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Bowtie2 output as BAM</title>
		<link>http://www.ngcrawford.com/2012/03/14/bowtie2-output-as-bam/</link>
		<comments>http://www.ngcrawford.com/2012/03/14/bowtie2-output-as-bam/#comments</comments>
		<pubDate>Wed, 14 Mar 2012 14:01:46 +0000</pubDate>
		<dc:creator>Nick</dc:creator>
				<category><![CDATA[Software]]></category>
		<category><![CDATA[bowtie2]]></category>
		<category><![CDATA[pipe]]></category>
		<category><![CDATA[samtools]]></category>

		<guid isPermaLink="false">http://www.ngcrawford.com/?p=442</guid>
		<description><![CDATA[Bowtie2 is a short read aligner that is optimized for aligning longer reads of lengths of 50 bp or greater. I&#8217;ve been playing around with it and was initially puzzled by the fact that it only outputs SAM formated alignments. Then I realized you can pipe the output straight into samtools which will do the [...]]]></description>
				<content:encoded><![CDATA[<p><a title="bowtie2" href="http://bowtie-bio.sourceforge.net/bowtie2/index.shtml">Bowtie2</a> is a short read aligner that is optimized for aligning longer reads of lengths of 50 bp or greater. I&#8217;ve been playing around with it and was initially puzzled by the fact that it only outputs <a title="SAM format spec" href="http://samtools.sourceforge.net/samtools.shtml#5" target="_blank">SAM formated</a> alignments. Then I realized you can pipe the output straight into <a title="samtools" href="http://samtools.sourceforge.net/samtools.shtml" target="_blank">samtools</a> which will do the compression to BAM for you.</p>


<div class="wp-geshi-highlight-wrap5"><div class="wp-geshi-highlight-wrap4"><div class="wp-geshi-highlight-wrap3"><div class="wp-geshi-highlight-wrap2"><div class="wp-geshi-highlight-wrap"><div class="wp-geshi-highlight"><div class="sh"><pre class="de1">$ bowtie2 \
-p 4 \
-x /genome/index \
-1 pair2.fastq \
-2 pair2.fastq \
-U unpaired.fastq \
--very-sensitive \
-X 1000 \
-I 200 \
| samtools view -bS - &gt; output.bam</pre></div></div></div></div></div></div></div>


]]></content:encoded>
			<wfw:commentRss>http://www.ngcrawford.com/2012/03/14/bowtie2-output-as-bam/feed/</wfw:commentRss>
		<slash:comments>3</slash:comments>
		</item>
		<item>
		<title>Installing ABySS</title>
		<link>http://www.ngcrawford.com/2010/12/05/installing-abyss/</link>
		<comments>http://www.ngcrawford.com/2010/12/05/installing-abyss/#comments</comments>
		<pubDate>Sun, 05 Dec 2010 19:08:08 +0000</pubDate>
		<dc:creator>Nick</dc:creator>
				<category><![CDATA[Blog]]></category>
		<category><![CDATA[ABySS]]></category>
		<category><![CDATA[Installation]]></category>
		<category><![CDATA[Instructions]]></category>

		<guid isPermaLink="false">http://www.ngcrawford.com/?p=344</guid>
		<description><![CDATA[To install ABySS on an system running an older version of gcc and use the following commands. &#62;./configure &#8211;enable-maxk=96 &#8211;disable-openmp \ CPPFLAGS=-I&#60;path to google-sparsehash install&#62;/include \ &#8211;prefix=&#60;ABySS install directory&#62;/ &#62;make AM_CXXFLAGS=-Wall &#62;make install]]></description>
				<content:encoded><![CDATA[<p><!-- p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 13.0px Arial} p.p2 {margin: 0.0px 0.0px 0.0px 0.0px; font: 13.0px Arial; min-height: 15.0px} p.p3 {margin: 0.0px 0.0px 0.0px 0.0px; font: 13.0px Helvetica} -->To install <a href="http://www.bcgsc.ca/platform/bioinfo/software/abyss">ABySS</a> on an system running an older version of gcc and use the following commands.</p>
<blockquote><p>&gt;./configure &#8211;enable-maxk=96 &#8211;disable-openmp \</p>
<p>CPPFLAGS=-I&lt;path to <a href="http://code.google.com/p/google-sparsehash/">google-sparsehash</a> install&gt;/include \</p>
<p>&#8211;prefix=&lt;ABySS install directory&gt;/</p>
<p>&gt;make AM_CXXFLAGS=-Wall</p>
<p>&gt;make install</p></blockquote>
]]></content:encoded>
			<wfw:commentRss>http://www.ngcrawford.com/2010/12/05/installing-abyss/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Word by word in Terminal</title>
		<link>http://www.ngcrawford.com/2010/09/27/word-by-word-in-terminal/</link>
		<comments>http://www.ngcrawford.com/2010/09/27/word-by-word-in-terminal/#comments</comments>
		<pubDate>Mon, 27 Sep 2010 15:33:42 +0000</pubDate>
		<dc:creator>Nick</dc:creator>
				<category><![CDATA[Blog]]></category>

		<guid isPermaLink="false">http://www.ngcrawford.com/?p=339</guid>
		<description><![CDATA[One of the annoying things in OS terminal is that if you want to traverse word by word in a line of text you need to type &#8216;esc-b&#8217; and &#8216;exc-f&#8217;.  This post on macromates, the textmate blog, explains how to reset these keys. Enjoy.]]></description>
				<content:encoded><![CDATA[<p>One of the annoying things in OS terminal is that if you want to traverse word by word in a line of text you need to type &#8216;esc-b&#8217; and &#8216;exc-f&#8217;.  This post on <a href="http://blog.macromates.com/2006/word-movement-in-terminal/">macromates</a>, the textmate blog, explains how to reset these keys. Enjoy.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.ngcrawford.com/2010/09/27/word-by-word-in-terminal/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>MultiMarkdown</title>
		<link>http://www.ngcrawford.com/2010/06/09/multimarkdown/</link>
		<comments>http://www.ngcrawford.com/2010/06/09/multimarkdown/#comments</comments>
		<pubDate>Wed, 09 Jun 2010 15:56:58 +0000</pubDate>
		<dc:creator>Nick</dc:creator>
				<category><![CDATA[Blog]]></category>
		<category><![CDATA[Software]]></category>

		<guid isPermaLink="false">http://www.ngcrawford.com/?p=336</guid>
		<description><![CDATA[Since I started using github in a serious way back in January I&#8217;ve begun writing my documentation in the  markdown format that displays so nicely on github. Markdown is essentially a parsing tool and a simple text syntax that allows the easy conversion of human &#8216;readable text&#8217; to html. It&#8217;s intuitive, it took less than 5 minutes to pick up, [...]]]></description>
				<content:encoded><![CDATA[<p>Since I started using <a href="http://www.github.com">github</a> in a serious way back in January I&#8217;ve begun writing my documentation in the  <a href="http://daringfireball.net/projects/markdown/">markdown</a> format that displays so nicely on github. Markdown is essentially a parsing tool and a simple text syntax that allows the easy conversion of human &#8216;readable text&#8217; to html. It&#8217;s intuitive, it took less than 5 minutes to pick up, and saves me a ton of time not writing HTML. However, its ease of use is tempered, a bit, by a lack of features.  Although it is easy to create headers, lists, and code bocks &#8211; simple HTML stuff &#8211; it doesn&#8217;t include the option to create tables, formated mathematical formulas, citations and bibliographies.  Since I&#8217;m a scientist who wants to produce documents with these sorts of features, this is annoying.</p>
<p>Luckily, the markdown syntax has recently been extended, in a project called <a href="http://fletcherpenney.net/multimarkdown/">MultiMarkdown</a>, to include many of the aforementioned features. Multimarkdown essentially merges the markdown syntax with <a href="http://www.latex-project.org/">LaTeX</a> which, if you haven&#8217;t heard of it, is a rather inscrutable, but extremely powerful text formatting language.  It&#8217;s popular in the CS and physics disciplines.  LaTeX produces beautiful documents, but it&#8217;s easy to spend a week or more adjusting the formatting and reading the API trying to figure out some of the more complicated features.  Multimarkdown looks like it will do much of the more basic LaTeX formatting, but without the headache.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.ngcrawford.com/2010/06/09/multimarkdown/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Vertebrate Zoology: Bi302</title>
		<link>http://www.ngcrawford.com/2010/01/22/vertebrate-zoology-bi302/</link>
		<comments>http://www.ngcrawford.com/2010/01/22/vertebrate-zoology-bi302/#comments</comments>
		<pubDate>Fri, 22 Jan 2010 14:21:11 +0000</pubDate>
		<dc:creator>Nick</dc:creator>
				<category><![CDATA[Courses]]></category>

		<guid isPermaLink="false">http://www.ngcrawford.com/?p=329</guid>
		<description><![CDATA[Welcome to Vertebrate Zoology lab (Bi302 Lab).  I&#8217;ll be using this portion of my website to post notes/slides and to answer your questions.  Please make use of the comments. More to come.]]></description>
				<content:encoded><![CDATA[<p>Welcome to Vertebrate Zoology lab (Bi302 Lab).  I&#8217;ll be using this portion of my website to post notes/slides and to answer your questions.  Please make use of the comments. More to come.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.ngcrawford.com/2010/01/22/vertebrate-zoology-bi302/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

<!-- Performance optimized by W3 Total Cache. Learn more: http://www.w3-edge.com/wordpress-plugins/

Minified using disk
Page Caching using disk: enhanced
Database Caching 11/57 queries in 0.434 seconds using disk
Object Caching 1130/1258 objects using disk
Content Delivery Network via Amazon Web Services: CloudFront: d342a5v7r5q57q.cloudfront.net

 Served from: www.ngcrawford.com @ 2013-05-20 13:59:03 by W3 Total Cache -->