<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:wfw="http://wellformedweb.org/CommentAPI/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
	xmlns:slash="http://purl.org/rss/1.0/modules/slash/"
	>

<channel>
	<title>Nick Crawford</title>
	<atom:link href="http://www.ngcrawford.com/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.ngcrawford.com</link>
	<description>Evolution and more...</description>
	<lastBuildDate>Thu, 17 May 2012 11:43:32 +0000</lastBuildDate>
	<language>en</language>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
	<generator>http://wordpress.org/?v=3.3.2</generator>
		<item>
		<title>Python: Adding Read Group (@RG) tags to BAM or SAM files</title>
		<link>http://www.ngcrawford.com/2012/04/17/python-adding-read-group-rg-tags-to-bam-or-sam-files/</link>
		<comments>http://www.ngcrawford.com/2012/04/17/python-adding-read-group-rg-tags-to-bam-or-sam-files/#comments</comments>
		<pubDate>Tue, 17 Apr 2012 16:44:11 +0000</pubDate>
		<dc:creator>Nick</dc:creator>
				<category><![CDATA[Blog]]></category>
		<category><![CDATA[Software]]></category>
		<category><![CDATA[@RG]]></category>
		<category><![CDATA[BAM]]></category>
		<category><![CDATA[GATK]]></category>
		<category><![CDATA[pysam]]></category>
		<category><![CDATA[SAM]]></category>

		<guid isPermaLink="false">http://www.ngcrawford.com/?p=506</guid>
		<description><![CDATA[The SAM specification now requires @RG tags to be included in all SAM/BAM alignments. If you are using GATK you have probably noticed that it will not run without them. Since @RG tags weren&#8217;t standard until recently, I&#8217;ve written a script to add them  in post hoc. You&#8217;ll need to install pysam and python2.7 to get [...]]]></description>
			<content:encoded><![CDATA[<p>The <a title="Link to PDF" href="http://samtools.sourceforge.net/SAM1.pdf" target="_blank">SAM specification</a> now requires @RG tags to be included in all SAM/BAM alignments. If you are using <a title="GATK" href="http://www.broadinstitute.org/gsa/wiki/index.php/Home_Page" target="_blank">GATK</a> you have probably noticed that it will not run without them. Since @RG tags weren&#8217;t standard until recently, I&#8217;ve written a script to add them  in post hoc. You&#8217;ll need to install <a title="pysam" href="http://code.google.com/p/pysam/" target="_blank">pysam</a> and <a title="Python2.7" href="http://www.python.org/getit/releases/2.7/" target="_blank">python2.7</a> to get it to work.</p>
<div id="gist-2407317" class="gist">

        <div class="gist-file">
          <div class="gist-data gist-syntax">
              <div class="highlight"><pre><div class='line' id='LC1'><span class="kn">import</span> <span class="nn">os</span></div><div class='line' id='LC2'><span class="kn">import</span> <span class="nn">sys</span></div><div class='line' id='LC3'><span class="kn">import</span> <span class="nn">glob</span></div><div class='line' id='LC4'><span class="kn">import</span> <span class="nn">pysam</span></div><div class='line' id='LC5'><span class="kn">import</span> <span class="nn">argparse</span></div><div class='line' id='LC6'><span class="kn">import</span> <span class="nn">multiprocessing</span></div><div class='line' id='LC7'><br/></div><div class='line' id='LC8'><span class="k">def</span> <span class="nf">get_args</span><span class="p">():</span></div><div class='line' id='LC9'>&nbsp;&nbsp;&nbsp;&nbsp;<span class="sd">&#39;&#39;&#39;Parse sys.argv&#39;&#39;&#39;</span></div><div class='line' id='LC10'>&nbsp;&nbsp;&nbsp;&nbsp;<span class="n">parser</span> <span class="o">=</span> <span class="n">argparse</span><span class="o">.</span><span class="n">ArgumentParser</span><span class="p">()</span></div><div class='line' id='LC11'>&nbsp;&nbsp;&nbsp;&nbsp;<span class="n">parser</span><span class="o">.</span><span class="n">add_argument</span><span class="p">(</span><span class="s">&#39;--cores&#39;</span><span class="p">,</span> <span class="nb">type</span><span class="o">=</span><span class="nb">int</span><span class="p">,</span></div><div class='line' id='LC12'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="n">help</span><span class="o">=</span><span class="s">&#39;the number of avalible processor cores&#39;</span><span class="p">)</span></div><div class='line' id='LC13'>&nbsp;&nbsp;&nbsp;&nbsp;<span class="n">parser</span><span class="o">.</span><span class="n">add_argument</span><span class="p">(</span><span class="s">&#39;-i&#39;</span><span class="p">,</span><span class="s">&#39;--input-dir&#39;</span><span class="p">,</span> <span class="n">required</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span></div><div class='line' id='LC14'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="n">help</span><span class="o">=</span><span class="s">&#39;The input directory containing the bam files.&#39;</span><span class="p">)</span></div><div class='line' id='LC15'>&nbsp;&nbsp;&nbsp;&nbsp;<span class="n">parser</span><span class="o">.</span><span class="n">add_argument</span><span class="p">(</span><span class="s">&#39;-CN&#39;</span><span class="p">,</span><span class="nb">type</span><span class="o">=</span><span class="nb">str</span><span class="p">,</span></div><div class='line' id='LC16'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="n">help</span><span class="o">=</span><span class="s">&quot;Name of sequencing center producing the read. </span><span class="se">\</span></div><div class='line' id='LC17'><span class="s">                        GATK not required.&quot;</span><span class="p">)</span></div><div class='line' id='LC18'>&nbsp;&nbsp;&nbsp;&nbsp;<span class="n">parser</span><span class="o">.</span><span class="n">add_argument</span><span class="p">(</span><span class="s">&#39;-DS&#39;</span><span class="p">,</span><span class="nb">type</span><span class="o">=</span><span class="nb">str</span><span class="p">,</span></div><div class='line' id='LC19'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="n">help</span><span class="o">=</span><span class="s">&quot;Description. GATK Not Required. &quot;</span><span class="p">)</span></div><div class='line' id='LC20'>&nbsp;&nbsp;&nbsp;&nbsp;<span class="n">parser</span><span class="o">.</span><span class="n">add_argument</span><span class="p">(</span><span class="s">&#39;-DT&#39;</span><span class="p">,</span><span class="nb">type</span><span class="o">=</span><span class="nb">str</span><span class="p">,</span> <span class="n">required</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span></div><div class='line' id='LC21'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="n">help</span><span class="o">=</span><span class="s">&quot;Date the run was produced (ISO8601 date or date/time). </span><span class="se">\</span></div><div class='line' id='LC22'><span class="s">                        GATK Not Required. &quot;</span><span class="p">)</span></div><div class='line' id='LC23'>&nbsp;&nbsp;&nbsp;&nbsp;<span class="n">parser</span><span class="o">.</span><span class="n">add_argument</span><span class="p">(</span><span class="s">&#39;-PI&#39;</span><span class="p">,</span><span class="nb">type</span><span class="o">=</span><span class="nb">int</span><span class="p">,</span> <span class="n">required</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span></div><div class='line' id='LC24'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="n">help</span><span class="o">=</span><span class="s">&quot;Predicted median insert size. GATK Not Required.&quot;</span><span class="p">)</span></div><div class='line' id='LC25'>&nbsp;&nbsp;&nbsp;&nbsp;<span class="n">parser</span><span class="o">.</span><span class="n">add_argument</span><span class="p">(</span><span class="s">&#39;-PL&#39;</span><span class="p">,</span><span class="nb">type</span><span class="o">=</span><span class="nb">str</span><span class="p">,</span> <span class="n">required</span><span class="o">=</span><span class="bp">True</span><span class="p">,</span></div><div class='line' id='LC26'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="n">choices</span> <span class="o">=</span> <span class="p">[</span><span class="s">&#39;CAPILLARY&#39;</span><span class="p">,</span> <span class="s">&#39;LS454&#39;</span><span class="p">,</span> <span class="s">&#39;ILLUMINA&#39;</span><span class="p">,</span> <span class="s">&#39;SOLID&#39;</span><span class="p">,</span> </div><div class='line' id='LC27'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="s">&#39;HELICOS&#39;</span><span class="p">,</span> <span class="s">&#39;IONTORRENT&#39;</span><span class="p">,</span> <span class="s">&#39;PACBIO&#39;</span><span class="p">],</span></div><div class='line' id='LC28'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="n">help</span><span class="o">=</span><span class="s">&quot;Platform/technology used to produce the reads.&quot;</span><span class="p">)</span></div><div class='line' id='LC29'><br/></div><div class='line' id='LC30'>&nbsp;&nbsp;&nbsp;&nbsp;<span class="n">args</span> <span class="o">=</span> <span class="n">parser</span><span class="o">.</span><span class="n">parse_args</span><span class="p">()</span></div><div class='line' id='LC31'>&nbsp;&nbsp;&nbsp;&nbsp;<span class="k">return</span> <span class="n">args</span></div><div class='line' id='LC32'><br/></div><div class='line' id='LC33'><span class="k">def</span> <span class="nf">addRG2Header</span><span class="p">(</span><span class="n">filename</span><span class="p">,</span> <span class="n">sam_info</span><span class="p">,</span> <span class="n">args</span><span class="p">):</span></div><div class='line' id='LC34'>&nbsp;&nbsp;&nbsp;&nbsp;<span class="sd">&quot;&quot;&quot;Add read group info to a header.&quot;&quot;&quot;</span></div><div class='line' id='LC35'>&nbsp;&nbsp;&nbsp;&nbsp;<span class="c"># CREATE TEMPLATE</span></div><div class='line' id='LC36'>&nbsp;&nbsp;&nbsp;&nbsp;<span class="c"># Read group. Unordered multiple @RG lines are allowed.</span></div><div class='line' id='LC37'>&nbsp;&nbsp;&nbsp;&nbsp;<span class="n">RG_template</span> <span class="o">=</span> <span class="p">{</span> <span class="s">&#39;ID&#39;</span><span class="p">:</span> <span class="s">&#39;&#39;</span><span class="p">,</span>           <span class="c"># Read group identifier. e.g., Illumina flowcell + lane name and number</span></div><div class='line' id='LC38'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="s">&#39;CN&#39;</span><span class="p">:</span> <span class="s">&#39;&#39;</span><span class="p">,</span>           <span class="c"># GATK Not Required. Name of sequencing center producing the read.</span></div><div class='line' id='LC39'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="s">&#39;DS&#39;</span><span class="p">:</span> <span class="s">&#39;&#39;</span><span class="p">,</span>           <span class="c"># GATK Not Required. Description</span></div><div class='line' id='LC40'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="s">&#39;DT&#39;</span><span class="p">:</span> <span class="s">&#39;&#39;</span><span class="p">,</span>           <span class="c"># GATK Not Required. Date the run was produced (ISO8601 date YYYY-MM-DD or YYYYMMDD)</span></div><div class='line' id='LC41'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="s">&#39;PI&#39;</span><span class="p">:</span> <span class="s">&#39;&#39;</span><span class="p">,</span>           <span class="c"># GATK Not Required. Predicted median insert size.</span></div><div class='line' id='LC42'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="s">&#39;PU&#39;</span><span class="p">:</span> <span class="s">&#39;&#39;</span><span class="p">,</span>           <span class="c"># GATK Not Required. Platform unit (e.g. flowcell-barcode.lane for Illumina or slide for SOLiD).</span></div><div class='line' id='LC43'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="s">&#39;SM&#39;</span><span class="p">:</span> <span class="s">&#39;&#39;</span><span class="p">,</span>           <span class="c"># Sample. Use pool name where a pool is being sequenced.</span></div><div class='line' id='LC44'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="s">&#39;PL&#39;</span><span class="p">:</span> <span class="s">&#39;ILLUMINA&#39;</span><span class="p">}</span>   <span class="c"># Platform/technology used to produce the reads.</span></div><div class='line' id='LC45'><br/></div><div class='line' id='LC46'>&nbsp;&nbsp;&nbsp;&nbsp;<span class="n">samfile</span> <span class="o">=</span> <span class="n">pysam</span><span class="o">.</span><span class="n">Samfile</span><span class="p">(</span><span class="n">filename</span><span class="p">,</span> <span class="s">&#39;r&#39;</span><span class="p">)</span></div><div class='line' id='LC47'>&nbsp;&nbsp;&nbsp;&nbsp;<span class="n">new_header</span> <span class="o">=</span> <span class="n">samfile</span><span class="o">.</span><span class="n">header</span><span class="o">.</span><span class="n">copy</span><span class="p">()</span></div><div class='line' id='LC48'>&nbsp;&nbsp;&nbsp;&nbsp;<span class="n">samfile</span><span class="o">.</span><span class="n">close</span><span class="p">()</span></div><div class='line' id='LC49'><br/></div><div class='line' id='LC50'>&nbsp;&nbsp;&nbsp;&nbsp;<span class="c"># ADD INFO TO TEMPLATE</span></div><div class='line' id='LC51'>&nbsp;&nbsp;&nbsp;&nbsp;<span class="n">RG_template</span> <span class="o">=</span> <span class="n">RG_template</span><span class="o">.</span><span class="n">copy</span><span class="p">()</span></div><div class='line' id='LC52'>&nbsp;&nbsp;&nbsp;&nbsp;<span class="n">RG_template</span><span class="p">[</span><span class="s">&#39;ID&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">sam_info</span><span class="p">[</span><span class="s">&quot;sample_name&quot;</span><span class="p">]</span></div><div class='line' id='LC53'>&nbsp;&nbsp;&nbsp;&nbsp;<span class="k">if</span> <span class="n">args</span><span class="o">.</span><span class="n">CN</span><span class="p">:</span> <span class="n">RG_template</span><span class="p">[</span><span class="s">&#39;CN&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">args</span><span class="o">.</span><span class="n">CN</span><span class="o">.</span><span class="n">upper</span><span class="p">()</span></div><div class='line' id='LC54'>&nbsp;&nbsp;&nbsp;&nbsp;<span class="k">if</span> <span class="n">args</span><span class="o">.</span><span class="n">DS</span><span class="p">:</span> <span class="n">RG_template</span><span class="p">[</span><span class="s">&#39;DS&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">args</span><span class="o">.</span><span class="n">DS</span></div><div class='line' id='LC55'>&nbsp;&nbsp;&nbsp;&nbsp;<span class="n">RG_template</span><span class="p">[</span><span class="s">&#39;DT&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">args</span><span class="o">.</span><span class="n">DT</span></div><div class='line' id='LC56'>&nbsp;&nbsp;&nbsp;&nbsp;<span class="n">RG_template</span><span class="p">[</span><span class="s">&#39;LB&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">sam_info</span><span class="p">[</span><span class="s">&quot;sample_name&quot;</span><span class="p">]</span></div><div class='line' id='LC57'>&nbsp;&nbsp;&nbsp;&nbsp;<span class="n">RG_template</span><span class="p">[</span><span class="s">&#39;SM&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">sam_info</span><span class="p">[</span><span class="s">&quot;sample_name&quot;</span><span class="p">]</span></div><div class='line' id='LC58'>&nbsp;&nbsp;&nbsp;&nbsp;<span class="n">RG_template</span><span class="p">[</span><span class="s">&#39;DS&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="s">&quot;{0}.{1}&quot;</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">sam_info</span><span class="p">[</span><span class="s">&#39;sample_name&#39;</span><span class="p">],</span> <span class="n">sam_info</span><span class="p">[</span><span class="s">&#39;locality&#39;</span><span class="p">])</span></div><div class='line' id='LC59'>&nbsp;&nbsp;&nbsp;&nbsp;<span class="n">RG_template</span><span class="p">[</span><span class="s">&#39;PI&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="n">args</span><span class="o">.</span><span class="n">PI</span></div><div class='line' id='LC60'>&nbsp;&nbsp;&nbsp;&nbsp;<span class="n">RG_template</span><span class="p">[</span><span class="s">&#39;PU&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="s">&#39;{0}.{1}&#39;</span><span class="o">.</span><span class="n">format</span><span class="p">(</span><span class="n">sam_info</span><span class="p">[</span><span class="s">&#39;flowcell_id&#39;</span><span class="p">],</span> <span class="n">sam_info</span><span class="p">[</span><span class="s">&#39;lane&#39;</span><span class="p">])</span></div><div class='line' id='LC61'>&nbsp;&nbsp;&nbsp;&nbsp;<span class="n">new_header</span><span class="p">[</span><span class="s">&#39;RG&#39;</span><span class="p">]</span> <span class="o">=</span> <span class="p">[</span><span class="n">RG_template</span><span class="p">]</span></div><div class='line' id='LC62'>&nbsp;&nbsp;&nbsp;&nbsp;<span class="k">return</span> <span class="n">new_header</span></div><div class='line' id='LC63'><br/></div><div class='line' id='LC64'><br/></div><div class='line' id='LC65'><span class="k">def</span> <span class="nf">add_RGs_2_BAMs_runner</span><span class="p">(</span><span class="n">data</span><span class="p">):</span></div><div class='line' id='LC66'>&nbsp;&nbsp;&nbsp;&nbsp;<span class="sd">&quot;&quot;&quot;Generates the correct @RG header and adds a RG field to a bam file.&quot;&quot;&quot;</span></div><div class='line' id='LC67'>&nbsp;&nbsp;&nbsp;&nbsp;<span class="c"># Make Bam of Sam if it doesn&#39;t exist</span></div><div class='line' id='LC68'>&nbsp;&nbsp;&nbsp;&nbsp;<span class="n">filename</span><span class="p">,</span> <span class="n">new_RG_header</span> <span class="o">=</span> <span class="n">data</span></div><div class='line' id='LC69'>&nbsp;&nbsp;&nbsp;&nbsp;<span class="k">if</span> <span class="n">filename</span><span class="o">.</span><span class="n">endswith</span><span class="p">(</span><span class="s">&#39;sam&#39;</span><span class="p">):</span></div><div class='line' id='LC70'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="n">sam_handle</span> <span class="o">=</span> <span class="n">pysam</span><span class="o">.</span><span class="n">Samfile</span><span class="p">(</span><span class="n">filename</span><span class="p">)</span></div><div class='line' id='LC71'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="n">bam_name</span> <span class="o">=</span> <span class="n">os</span><span class="o">.</span><span class="n">path</span><span class="o">.</span><span class="n">splitext</span><span class="p">(</span><span class="n">filename</span><span class="p">)[</span><span class="mi">0</span><span class="p">]</span><span class="o">+</span><span class="s">&quot;.bam&quot;</span></div><div class='line' id='LC72'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="n">bam_handle</span> <span class="o">=</span> <span class="n">pysam</span><span class="o">.</span><span class="n">Samfile</span><span class="p">(</span> <span class="n">bam_name</span><span class="p">,</span> <span class="s">&quot;wb&quot;</span><span class="p">,</span> <span class="n">template</span> <span class="o">=</span> <span class="n">sam_handle</span> <span class="p">)</span></div><div class='line' id='LC73'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="k">for</span> <span class="n">s</span> <span class="ow">in</span> <span class="n">sam_handle</span><span class="p">:</span></div><div class='line' id='LC74'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="n">bam_handle</span><span class="o">.</span><span class="n">write</span><span class="p">(</span><span class="n">s</span><span class="p">)</span></div><div class='line' id='LC75'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="n">filename</span> <span class="o">=</span> <span class="n">bam_name</span></div><div class='line' id='LC76'><br/></div><div class='line' id='LC77'>&nbsp;&nbsp;&nbsp;&nbsp;<span class="c"># Massage paths and make outputfiles</span></div><div class='line' id='LC78'>&nbsp;&nbsp;&nbsp;&nbsp;<span class="n">pysam</span><span class="o">.</span><span class="n">sort</span><span class="p">(</span><span class="n">filename</span><span class="p">,</span> <span class="n">os</span><span class="o">.</span><span class="n">path</span><span class="o">.</span><span class="n">splitext</span><span class="p">(</span><span class="n">filename</span><span class="p">)[</span><span class="mi">0</span><span class="p">]</span><span class="o">+</span><span class="s">&quot;.sorted&quot;</span><span class="p">)</span></div><div class='line' id='LC79'>&nbsp;&nbsp;&nbsp;&nbsp;<span class="n">pysam</span><span class="o">.</span><span class="n">index</span><span class="p">(</span><span class="n">os</span><span class="o">.</span><span class="n">path</span><span class="o">.</span><span class="n">splitext</span><span class="p">(</span><span class="n">filename</span><span class="p">)[</span><span class="mi">0</span><span class="p">]</span><span class="o">+</span><span class="s">&quot;.sorted.bam&quot;</span><span class="p">)</span></div><div class='line' id='LC80'>&nbsp;&nbsp;&nbsp;&nbsp;<span class="n">filename</span> <span class="o">=</span> <span class="n">os</span><span class="o">.</span><span class="n">path</span><span class="o">.</span><span class="n">splitext</span><span class="p">(</span><span class="n">filename</span><span class="p">)[</span><span class="mi">0</span><span class="p">]</span><span class="o">+</span><span class="s">&quot;.sorted.bam&quot;</span></div><div class='line' id='LC81'>&nbsp;&nbsp;&nbsp;&nbsp;<span class="n">path</span><span class="p">,</span> <span class="n">filename</span> <span class="o">=</span> <span class="n">os</span><span class="o">.</span><span class="n">path</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="n">filename</span><span class="p">)</span></div><div class='line' id='LC82'>&nbsp;&nbsp;&nbsp;&nbsp;<span class="n">name</span><span class="p">,</span> <span class="n">ext</span> <span class="o">=</span> <span class="n">os</span><span class="o">.</span><span class="n">path</span><span class="o">.</span><span class="n">splitext</span><span class="p">(</span><span class="n">filename</span><span class="p">)</span></div><div class='line' id='LC83'>&nbsp;&nbsp;&nbsp;&nbsp;<span class="n">new_name</span> <span class="o">=</span> <span class="n">name</span> <span class="o">+</span> <span class="s">&#39;.wRG.&#39;</span> <span class="o">+</span> <span class="s">&#39;bam&#39;</span></div><div class='line' id='LC84'>&nbsp;&nbsp;&nbsp;&nbsp;<span class="n">outfile_name</span> <span class="o">=</span>  <span class="n">os</span><span class="o">.</span><span class="n">path</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">path</span><span class="p">,</span><span class="n">new_name</span><span class="p">)</span></div><div class='line' id='LC85'>&nbsp;&nbsp;&nbsp;&nbsp;<span class="n">outfile</span> <span class="o">=</span> <span class="n">pysam</span><span class="o">.</span><span class="n">Samfile</span><span class="p">(</span> <span class="n">outfile_name</span><span class="p">,</span> <span class="s">&#39;wb&#39;</span><span class="p">,</span> <span class="n">header</span> <span class="o">=</span> <span class="n">new_RG_header</span> <span class="p">)</span></div><div class='line' id='LC86'><br/></div><div class='line' id='LC87'>&nbsp;&nbsp;&nbsp;&nbsp;<span class="c"># Step 2: Process Samfile adding Read Group to Each Read</span></div><div class='line' id='LC88'>&nbsp;&nbsp;&nbsp;&nbsp;<span class="n">samfile</span> <span class="o">=</span> <span class="n">pysam</span><span class="o">.</span><span class="n">Samfile</span><span class="p">(</span><span class="n">os</span><span class="o">.</span><span class="n">path</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">path</span><span class="p">,</span> <span class="n">filename</span><span class="p">))</span></div><div class='line' id='LC89'>&nbsp;&nbsp;&nbsp;&nbsp;<span class="n">samfile</span><span class="o">.</span><span class="n">fetch</span><span class="p">()</span></div><div class='line' id='LC90'>&nbsp;&nbsp;&nbsp;&nbsp;<span class="k">for</span> <span class="n">count</span><span class="p">,</span> <span class="n">read</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">samfile</span><span class="o">.</span><span class="n">fetch</span><span class="p">()):</span></div><div class='line' id='LC91'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="n">name</span> <span class="o">=</span> <span class="n">read</span><span class="o">.</span><span class="n">qname</span></div><div class='line' id='LC92'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="n">read_group</span> <span class="o">=</span> <span class="n">os</span><span class="o">.</span><span class="n">path</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="n">filename</span><span class="p">)[</span><span class="mi">1</span><span class="p">]</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="s">&quot;.&quot;</span><span class="p">)[</span><span class="mi">0</span><span class="p">]</span></div><div class='line' id='LC93'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="n">new_tags</span> <span class="o">=</span> <span class="n">read</span><span class="o">.</span><span class="n">tags</span></div><div class='line' id='LC94'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="n">new_tags</span><span class="o">.</span><span class="n">append</span><span class="p">((</span><span class="s">&#39;RG&#39;</span><span class="p">,</span> <span class="n">read_group</span><span class="p">))</span></div><div class='line' id='LC95'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="n">read</span><span class="o">.</span><span class="n">tags</span> <span class="o">=</span> <span class="n">new_tags</span></div><div class='line' id='LC96'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="n">outfile</span><span class="o">.</span><span class="n">write</span><span class="p">(</span><span class="n">read</span><span class="p">)</span></div><div class='line' id='LC97'>&nbsp;&nbsp;&nbsp;&nbsp;<span class="n">outfile</span><span class="o">.</span><span class="n">close</span><span class="p">()</span></div><div class='line' id='LC98'><br/></div><div class='line' id='LC99'>&nbsp;&nbsp;&nbsp;&nbsp;<span class="c"># Step 3: Make index of read group enabled samfile</span></div><div class='line' id='LC100'>&nbsp;&nbsp;&nbsp;&nbsp;<span class="n">pysam</span><span class="o">.</span><span class="n">index</span><span class="p">(</span><span class="n">outfile_name</span><span class="p">)</span></div><div class='line' id='LC101'>&nbsp;&nbsp;&nbsp;&nbsp;<span class="n">sys</span><span class="o">.</span><span class="n">stdout</span><span class="o">.</span><span class="n">write</span><span class="p">(</span><span class="s">&quot;.&quot;</span><span class="p">)</span></div><div class='line' id='LC102'>&nbsp;&nbsp;&nbsp;&nbsp;<span class="n">sys</span><span class="o">.</span><span class="n">stdout</span><span class="o">.</span><span class="n">flush</span><span class="p">()</span></div><div class='line' id='LC103'>&nbsp;&nbsp;&nbsp;&nbsp;<span class="k">return</span></div><div class='line' id='LC104'><br/></div><div class='line' id='LC105'><span class="k">def</span> <span class="nf">parseFileName</span><span class="p">(</span><span class="n">filepath</span><span class="p">):</span></div><div class='line' id='LC106'>&nbsp;&nbsp;&nbsp;&nbsp;<span class="n">path</span><span class="p">,</span> <span class="n">filename</span> <span class="o">=</span> <span class="n">os</span><span class="o">.</span><span class="n">path</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="n">filepath</span><span class="p">)</span></div><div class='line' id='LC107'>&nbsp;&nbsp;&nbsp;&nbsp;<span class="n">sample_name</span><span class="p">,</span> <span class="n">locality</span><span class="p">,</span> <span class="n">inline_tag</span><span class="p">,</span> <span class="n">third_read_tag</span> <span class="o">=</span> <span class="n">filename</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="s">&quot;.&quot;</span><span class="p">)[:</span><span class="o">-</span><span class="mi">1</span><span class="p">]</span></div><div class='line' id='LC108'>&nbsp;&nbsp;&nbsp;&nbsp;<span class="n">sam_handle</span> <span class="o">=</span> <span class="n">pysam</span><span class="o">.</span><span class="n">Samfile</span><span class="p">(</span><span class="n">filepath</span><span class="p">,</span><span class="s">&#39;r&#39;</span><span class="p">)</span></div><div class='line' id='LC109'>&nbsp;&nbsp;&nbsp;&nbsp;<span class="n">sam_line</span> <span class="o">=</span> <span class="n">sam_handle</span><span class="o">.</span><span class="n">next</span><span class="p">()</span></div><div class='line' id='LC110'>&nbsp;&nbsp;&nbsp;&nbsp;<span class="n">read_info</span> <span class="o">=</span> <span class="n">sam_line</span><span class="o">.</span><span class="n">qname</span></div><div class='line' id='LC111'>&nbsp;&nbsp;&nbsp;&nbsp;<span class="n">sam_handle</span><span class="o">.</span><span class="n">close</span><span class="p">()</span></div><div class='line' id='LC112'>&nbsp;&nbsp;&nbsp;&nbsp;<span class="n">instrument</span><span class="p">,</span> <span class="n">run_id</span><span class="p">,</span> <span class="n">flowcell_id</span><span class="p">,</span> <span class="n">lane</span> <span class="o">=</span> <span class="n">read_info</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="s">&quot;:&quot;</span><span class="p">)[:</span><span class="mi">4</span><span class="p">]</span></div><div class='line' id='LC113'>&nbsp;&nbsp;&nbsp;&nbsp;<span class="n">info</span> <span class="o">=</span> <span class="p">{</span><span class="s">&#39;sample_name&#39;</span><span class="p">:</span> <span class="n">sample_name</span><span class="p">,</span></div><div class='line' id='LC114'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="s">&#39;locality&#39;</span><span class="p">:</span> <span class="n">locality</span><span class="p">,</span></div><div class='line' id='LC115'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="s">&#39;inline_tag&#39;</span><span class="p">:</span> <span class="n">inline_tag</span><span class="p">,</span></div><div class='line' id='LC116'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="s">&#39;third_read_tag&#39;</span><span class="p">:</span> <span class="n">third_read_tag</span><span class="p">,</span></div><div class='line' id='LC117'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="s">&#39;instrument&#39;</span><span class="p">:</span><span class="n">instrument</span><span class="p">,</span>  </div><div class='line' id='LC118'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="s">&#39;run_id&#39;</span><span class="p">:</span> <span class="n">run_id</span><span class="p">,</span></div><div class='line' id='LC119'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="s">&#39;flowcell_id&#39;</span><span class="p">:</span> <span class="n">flowcell_id</span><span class="p">,</span></div><div class='line' id='LC120'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="s">&#39;lane&#39;</span><span class="p">:</span> <span class="n">lane</span><span class="p">}</span></div><div class='line' id='LC121'>&nbsp;&nbsp;&nbsp;&nbsp;<span class="k">return</span> <span class="n">info</span></div><div class='line' id='LC122'><br/></div><div class='line' id='LC123'><span class="k">def</span> <span class="nf">add_RGs_2_BAMs</span><span class="p">(</span><span class="n">pool</span><span class="p">,</span> <span class="n">args</span><span class="p">):</span></div><div class='line' id='LC124'>&nbsp;&nbsp;&nbsp;&nbsp;<span class="n">data_for_map</span> <span class="o">=</span> <span class="p">[]</span></div><div class='line' id='LC125'>&nbsp;&nbsp;&nbsp;&nbsp;<span class="k">print</span> <span class="s">&#39;Making RG headers.&#39;</span></div><div class='line' id='LC126'>&nbsp;&nbsp;&nbsp;&nbsp;<span class="k">for</span> <span class="n">count</span><span class="p">,</span> <span class="n">filename</span> <span class="ow">in</span> <span class="nb">enumerate</span><span class="p">(</span><span class="n">glob</span><span class="o">.</span><span class="n">glob</span><span class="p">(</span><span class="n">os</span><span class="o">.</span><span class="n">path</span><span class="o">.</span><span class="n">join</span><span class="p">(</span><span class="n">args</span><span class="o">.</span><span class="n">input_dir</span><span class="p">,</span><span class="s">&#39;*&#39;</span><span class="p">))):</span></div><div class='line' id='LC127'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="k">if</span> <span class="n">filename</span><span class="o">.</span><span class="n">endswith</span><span class="p">(</span><span class="s">&#39;sam&#39;</span><span class="p">)</span> <span class="ow">or</span> <span class="n">filename</span><span class="o">.</span><span class="n">endswith</span><span class="p">(</span><span class="s">&#39;bam&#39;</span><span class="p">):</span></div><div class='line' id='LC128'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="n">sam_info</span> <span class="o">=</span> <span class="n">parseFileName</span><span class="p">(</span><span class="n">filename</span><span class="p">)</span></div><div class='line' id='LC129'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="n">new_RG_header</span> <span class="o">=</span> <span class="n">addRG2Header</span><span class="p">(</span><span class="n">filename</span><span class="p">,</span> <span class="n">sam_info</span><span class="p">,</span> <span class="n">args</span><span class="p">)</span></div><div class='line' id='LC130'>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span class="n">data_for_map</span><span class="o">.</span><span class="n">append</span><span class="p">([</span><span class="n">filename</span><span class="p">,</span> <span class="n">new_RG_header</span><span class="p">])</span></div><div class='line' id='LC131'><br/></div><div class='line' id='LC132'>&nbsp;&nbsp;&nbsp;&nbsp;<span class="n">sys</span><span class="o">.</span><span class="n">stdout</span><span class="o">.</span><span class="n">write</span><span class="p">(</span><span class="s">&quot;</span><span class="se">\n</span><span class="s">Adding RGs and making BAMs&quot;</span><span class="p">)</span></div><div class='line' id='LC133'>&nbsp;&nbsp;&nbsp;&nbsp;<span class="n">sys</span><span class="o">.</span><span class="n">stdout</span><span class="o">.</span><span class="n">flush</span><span class="p">()</span></div><div class='line' id='LC134'>&nbsp;&nbsp;&nbsp;&nbsp;<span class="n">pool</span><span class="o">.</span><span class="n">map</span><span class="p">(</span><span class="n">add_RGs_2_BAMs_runner</span><span class="p">,</span> <span class="n">data_for_map</span><span class="p">)</span></div><div class='line' id='LC135'>&nbsp;&nbsp;&nbsp;&nbsp;<span class="k">return</span></div><div class='line' id='LC136'><br/></div><div class='line' id='LC137'><span class="k">def</span> <span class="nf">main</span><span class="p">():</span></div><div class='line' id='LC138'>&nbsp;&nbsp;&nbsp;&nbsp;<span class="n">args</span> <span class="o">=</span> <span class="n">get_args</span><span class="p">()</span></div><div class='line' id='LC139'>&nbsp;&nbsp;&nbsp;&nbsp;<span class="n">cores</span> <span class="o">=</span> <span class="n">args</span><span class="o">.</span><span class="n">cores</span></div><div class='line' id='LC140'>&nbsp;&nbsp;&nbsp;&nbsp;<span class="n">pool</span> <span class="o">=</span> <span class="n">multiprocessing</span><span class="o">.</span><span class="n">Pool</span><span class="p">(</span><span class="n">args</span><span class="o">.</span><span class="n">cores</span><span class="p">)</span></div><div class='line' id='LC141'>&nbsp;&nbsp;&nbsp;&nbsp;<span class="n">add_RGs_2_BAMs</span><span class="p">(</span><span class="n">pool</span><span class="p">,</span> <span class="n">args</span><span class="p">)</span></div><div class='line' id='LC142'><br/></div><div class='line' id='LC143'><span class="k">if</span> <span class="n">__name__</span> <span class="o">==</span> <span class="s">&#39;__main__&#39;</span><span class="p">:</span></div><div class='line' id='LC144'>&nbsp;&nbsp;&nbsp;&nbsp;<span class="n">main</span><span class="p">()</span></div><div class='line' id='LC145'><br/></div></pre></div>
          </div>

          <div class="gist-meta">
            <a href="https://gist.github.com/raw/2407317/6512ce8321e24690839e836c00c916505a6943b0/addReadGroup2BAMs.py" style="float:right;">view raw</a>
            <a href="https://gist.github.com/2407317#file_add_read_group2_ba_ms.py" style="float:right;margin-right:10px;color:#666">addReadGroup2BAMs.py</a>
            <a href="https://gist.github.com/2407317">This Gist</a> brought to you by <a href="http://github.com">GitHub</a>.
          </div>
        </div>
</div>

]]></content:encoded>
			<wfw:commentRss>http://www.ngcrawford.com/2012/04/17/python-adding-read-group-rg-tags-to-bam-or-sam-files/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Python: Multiprocessing large files</title>
		<link>http://www.ngcrawford.com/2012/03/29/python-multiprocessing-large-files/</link>
		<comments>http://www.ngcrawford.com/2012/03/29/python-multiprocessing-large-files/#comments</comments>
		<pubDate>Thu, 29 Mar 2012 13:08:43 +0000</pubDate>
		<dc:creator>Nick</dc:creator>
				<category><![CDATA[Blog]]></category>
		<category><![CDATA[big data]]></category>
		<category><![CDATA[MapReduce]]></category>
		<category><![CDATA[multiprocessing]]></category>
		<category><![CDATA[python]]></category>

		<guid isPermaLink="false">http://www.ngcrawford.com/?p=466</guid>
		<description><![CDATA[I been working with a lot of very large files and it has become increasing obvious that using a single processor core  is a major bottleneck to getting my data processed in a timely fashion. A MapReduce style algorithm seemed like the way to go, but I had a hard time finding a useful example. [...]]]></description>
			<content:encoded><![CDATA[<p>I been working with a lot of very large files and it has become increasing obvious that using a single processor core  is a major bottleneck to getting my data processed in a timely fashion. A <a href="http://en.wikipedia.org/wiki/MapReduce" target="_blank">MapReduce</a> style algorithm seemed like the way to go, but I had a hard time finding a useful example. After a bit of hacking about I came up with the following code.</p>
<p>The basic algorithmic idea is to first read in a large chunk of lines from the file. These are then partitioned out to the available cores and processed independently.  The new set of lines are then written to an output file or in this example just printed to the screen. Normally this would be tricky code to write, but python 2.7&#8242;s wonderful <a href="http://docs.python.org/library/multiprocessing.html">multiprocessing module</a> handles all the synchronization for you.</p>
<div id="gist-2237170" class="gist">

        <div class="gist-file">
          <div class="gist-data gist-syntax">
              <div class="highlight"><pre><div class='line' id='LC1'><span class="c">#!/usr/bin/env python</span></div><div class='line' id='LC2'><span class="c"># encoding: utf-8</span></div><div class='line' id='LC3'><br/></div><div class='line' id='LC4'><span class="kn">import</span> <span class="nn">multiprocessing</span></div><div class='line' id='LC5'><span class="kn">from</span> <span class="nn">textwrap</span> <span class="kn">import</span> <span class="n">dedent</span></div><div class='line' id='LC6'><span class="kn">from</span> <span class="nn">itertools</span> <span class="kn">import</span> <span class="n">izip_longest</span></div><div class='line' id='LC7'><br/></div><div class='line' id='LC8'><span class="k">def</span> <span class="nf">process_chunk</span><span class="p">(</span><span class="n">d</span><span class="p">):</span></div><div class='line' id='LC9'>	<span class="sd">&quot;&quot;&quot;Replace this with your own function</span></div><div class='line' id='LC10'><span class="sd">	that processes data one line at a</span></div><div class='line' id='LC11'><span class="sd">	time&quot;&quot;&quot;</span></div><div class='line' id='LC12'><br/></div><div class='line' id='LC13'>	<span class="n">d</span> <span class="o">=</span> <span class="n">d</span><span class="o">.</span><span class="n">strip</span><span class="p">()</span> <span class="o">+</span> <span class="s">&#39; processed&#39;</span></div><div class='line' id='LC14'>	<span class="k">return</span> <span class="n">d</span> </div><div class='line' id='LC15'><br/></div><div class='line' id='LC16'><span class="k">def</span> <span class="nf">grouper</span><span class="p">(</span><span class="n">n</span><span class="p">,</span> <span class="n">iterable</span><span class="p">,</span> <span class="n">padvalue</span><span class="o">=</span><span class="bp">None</span><span class="p">):</span></div><div class='line' id='LC17'>	<span class="sd">&quot;&quot;&quot;grouper(3, &#39;abcdefg&#39;, &#39;x&#39;) --&gt;</span></div><div class='line' id='LC18'><span class="sd">	(&#39;a&#39;,&#39;b&#39;,&#39;c&#39;), (&#39;d&#39;,&#39;e&#39;,&#39;f&#39;), (&#39;g&#39;,&#39;x&#39;,&#39;x&#39;)&quot;&quot;&quot;</span></div><div class='line' id='LC19'><br/></div><div class='line' id='LC20'>	<span class="k">return</span> <span class="n">izip_longest</span><span class="p">(</span><span class="o">*</span><span class="p">[</span><span class="nb">iter</span><span class="p">(</span><span class="n">iterable</span><span class="p">)]</span><span class="o">*</span><span class="n">n</span><span class="p">,</span> <span class="n">fillvalue</span><span class="o">=</span><span class="n">padvalue</span><span class="p">)</span></div><div class='line' id='LC21'><br/></div><div class='line' id='LC22'><span class="k">if</span> <span class="n">__name__</span> <span class="o">==</span> <span class="s">&#39;__main__&#39;</span><span class="p">:</span></div><div class='line' id='LC23'><br/></div><div class='line' id='LC24'>	<span class="c"># test data</span></div><div class='line' id='LC25'>	<span class="n">test_data</span> <span class="o">=</span> <span class="s">&quot;&quot;&quot;</span><span class="se">\</span></div><div class='line' id='LC26'><span class="s">	1 some test garbage</span></div><div class='line' id='LC27'><span class="s">	2 some test garbage</span></div><div class='line' id='LC28'><span class="s">	3 some test garbage</span></div><div class='line' id='LC29'><span class="s">	4 some test garbage</span></div><div class='line' id='LC30'><span class="s">	5 some test garbage</span></div><div class='line' id='LC31'><span class="s">	6 some test garbage</span></div><div class='line' id='LC32'><span class="s">	7 some test garbage</span></div><div class='line' id='LC33'><span class="s">	8 some test garbage</span></div><div class='line' id='LC34'><span class="s">	9 some test garbage</span></div><div class='line' id='LC35'><span class="s">	10 some test garbage</span></div><div class='line' id='LC36'><span class="s">	11 some test garbage</span></div><div class='line' id='LC37'><span class="s">	12 some test garbage</span></div><div class='line' id='LC38'><span class="s">	13 some test garbage</span></div><div class='line' id='LC39'><span class="s">	14 some test garbage</span></div><div class='line' id='LC40'><span class="s">	15 some test garbage</span></div><div class='line' id='LC41'><span class="s">	16 some test garbage</span></div><div class='line' id='LC42'><span class="s">	17 some test garbage</span></div><div class='line' id='LC43'><span class="s">	18 some test garbage</span></div><div class='line' id='LC44'><span class="s">	19 some test garbage</span></div><div class='line' id='LC45'><span class="s">	20 some test garbage&quot;&quot;&quot;</span></div><div class='line' id='LC46'>	<span class="n">test_data</span> <span class="o">=</span> <span class="n">dedent</span><span class="p">(</span><span class="n">test_data</span><span class="p">)</span></div><div class='line' id='LC47'>	<span class="n">test_data</span> <span class="o">=</span> <span class="n">test_data</span><span class="o">.</span><span class="n">split</span><span class="p">(</span><span class="s">&quot;</span><span class="se">\n</span><span class="s">&quot;</span><span class="p">)</span></div><div class='line' id='LC48'><br/></div><div class='line' id='LC49'>	<span class="c"># Create pool (p)</span></div><div class='line' id='LC50'>	<span class="n">p</span> <span class="o">=</span> <span class="n">multiprocessing</span><span class="o">.</span><span class="n">Pool</span><span class="p">(</span><span class="mi">4</span><span class="p">)</span></div><div class='line' id='LC51'><br/></div><div class='line' id='LC52'>	<span class="c"># Use &#39;grouper&#39; to split test data into</span></div><div class='line' id='LC53'>	<span class="c"># groups you can process without using a</span></div><div class='line' id='LC54'>	<span class="c"># ton of RAM. You&#39;ll probably want to </span></div><div class='line' id='LC55'>	<span class="c"># increase the chunk size considerably</span></div><div class='line' id='LC56'>	<span class="c"># to something like 1000 lines per core.</span></div><div class='line' id='LC57'><br/></div><div class='line' id='LC58'>	<span class="c"># The idea is that you replace &#39;test_data&#39;</span></div><div class='line' id='LC59'>	<span class="c"># with a file-handle</span></div><div class='line' id='LC60'>	<span class="c"># e.g., testdata = open(file.txt,&#39;rU&#39;)</span></div><div class='line' id='LC61'><br/></div><div class='line' id='LC62'>	<span class="c"># And, you&#39;d write to a file instead of</span></div><div class='line' id='LC63'>	<span class="c"># printing to the stout</span></div><div class='line' id='LC64'><br/></div><div class='line' id='LC65'>	<span class="k">for</span> <span class="n">chunk</span> <span class="ow">in</span> <span class="n">grouper</span><span class="p">(</span><span class="mi">10</span><span class="p">,</span> <span class="n">test_data</span><span class="p">):</span></div><div class='line' id='LC66'>		<span class="n">results</span> <span class="o">=</span> <span class="n">p</span><span class="o">.</span><span class="n">map</span><span class="p">(</span><span class="n">process_chunk</span><span class="p">,</span> <span class="n">chunk</span><span class="p">)</span></div><div class='line' id='LC67'>		<span class="k">for</span> <span class="n">r</span> <span class="ow">in</span> <span class="n">results</span><span class="p">:</span></div><div class='line' id='LC68'>			<span class="k">print</span> <span class="n">r</span> 	<span class="c"># replace with outfile.write()</span></div></pre></div>
          </div>

          <div class="gist-meta">
            <a href="https://gist.github.com/raw/2237170/4d6d20c3dfc12b4e52405fa0bb0dfb27a84a39e7/multiprocessing_template.py" style="float:right;">view raw</a>
            <a href="https://gist.github.com/2237170#file_multiprocessing_template.py" style="float:right;margin-right:10px;color:#666">multiprocessing_template.py</a>
            <a href="https://gist.github.com/2237170">This Gist</a> brought to you by <a href="http://github.com">GitHub</a>.
          </div>
        </div>
</div>

]]></content:encoded>
			<wfw:commentRss>http://www.ngcrawford.com/2012/03/29/python-multiprocessing-large-files/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Python: Interleave Paired-End Reads</title>
		<link>http://www.ngcrawford.com/2012/03/28/interleave-paired-end-reads/</link>
		<comments>http://www.ngcrawford.com/2012/03/28/interleave-paired-end-reads/#comments</comments>
		<pubDate>Thu, 29 Mar 2012 03:10:54 +0000</pubDate>
		<dc:creator>Nick</dc:creator>
				<category><![CDATA[Blog]]></category>
		<category><![CDATA[fastq]]></category>
		<category><![CDATA[next-gen sequencing]]></category>
		<category><![CDATA[python]]></category>

		<guid isPermaLink="false">http://www.ngcrawford.com/?p=472</guid>
		<description><![CDATA[Here&#8217;s a simple script for interleaving paired-end fastq files. You&#8217;ll need to do this if you want to create input files for velvet. Unlike the velvet&#8217;s shuffleSequences_fastq.pl perl script, this script handles gzipped input and output. It requires python 2.7.]]></description>
			<content:encoded><![CDATA[<p>Here&#8217;s a simple script for interleaving paired-end fastq files. You&#8217;ll need to do this if you want to create input files for <a title="velvet" href="http://www.ebi.ac.uk/~zerbino/velvet/">velvet</a>. Unlike the velvet&#8217;s <em>shuffleSequences_fastq.pl</em> perl script, this script handles gzipped input and output. It requires <a title="Python 2.7" href="http://www.python.org/getit/releases/2.7/" target="_blank">python 2.7</a>.</p>
<div id="gist-2232505" class="gist">

        <div class="gist-file">
          <div class="gist-data gist-syntax">
              <div class="highlight"><pre><div class='line' id='LC1'><span class="c">#!/usr/bin/env python</span></div><div class='line' id='LC2'><span class="c"># encoding: utf-8</span></div><div class='line' id='LC3'><br/></div><div class='line' id='LC4'><span class="kn">import</span> <span class="nn">gzip</span></div><div class='line' id='LC5'><span class="kn">import</span> <span class="nn">argparse</span></div><div class='line' id='LC6'><br/></div><div class='line' id='LC7'><span class="k">def</span> <span class="nf">interface</span><span class="p">():</span></div><div class='line' id='LC8'>	<span class="n">args</span> <span class="o">=</span> <span class="n">argparse</span><span class="o">.</span><span class="n">ArgumentParser</span><span class="p">()</span></div><div class='line' id='LC9'>	<span class="n">args</span><span class="o">.</span><span class="n">add_argument</span><span class="p">(</span><span class="s">&#39;-l&#39;</span><span class="p">,</span> <span class="s">&#39;--left-input&#39;</span><span class="p">,</span></div><div class='line' id='LC10'>		<span class="n">help</span><span class="o">=</span><span class="s">&#39;The first input file in fastq format.&#39;</span><span class="p">)</span></div><div class='line' id='LC11'><br/></div><div class='line' id='LC12'>	<span class="n">args</span><span class="o">.</span><span class="n">add_argument</span><span class="p">(</span><span class="s">&#39;-r&#39;</span><span class="p">,</span> <span class="s">&#39;--right-input&#39;</span><span class="p">,</span> </div><div class='line' id='LC13'>		<span class="n">help</span><span class="o">=</span><span class="s">&#39;The first input file in fastq format.&#39;</span><span class="p">)</span></div><div class='line' id='LC14'><br/></div><div class='line' id='LC15'>	<span class="n">args</span><span class="o">.</span><span class="n">add_argument</span><span class="p">(</span><span class="s">&#39;-o&#39;</span><span class="p">,</span> <span class="s">&#39;--output&#39;</span><span class="p">,</span> </div><div class='line' id='LC16'>		<span class="n">help</span><span class="o">=</span><span class="s">&#39;The output file in fastq format.&#39;</span><span class="p">)</span></div><div class='line' id='LC17'><br/></div><div class='line' id='LC18'>	<span class="n">args</span> <span class="o">=</span> <span class="n">args</span><span class="o">.</span><span class="n">parse_args</span><span class="p">()</span></div><div class='line' id='LC19'>	<span class="k">return</span> <span class="n">args</span></div><div class='line' id='LC20'><br/></div><div class='line' id='LC21'><br/></div><div class='line' id='LC22'><span class="k">def</span> <span class="nf">process_reads</span><span class="p">(</span><span class="n">args</span><span class="p">):</span></div><div class='line' id='LC23'><br/></div><div class='line' id='LC24'>	<span class="k">if</span> <span class="n">args</span><span class="o">.</span><span class="n">left_input</span><span class="o">.</span><span class="n">endswith</span><span class="p">(</span><span class="s">&#39;.gz&#39;</span><span class="p">)</span> <span class="ow">or</span> <span class="n">args</span><span class="o">.</span><span class="n">right_input</span><span class="o">.</span><span class="n">endswith</span><span class="p">(</span><span class="s">&#39;.gz&#39;</span><span class="p">):</span></div><div class='line' id='LC25'><br/></div><div class='line' id='LC26'>		<span class="n">left</span> <span class="o">=</span> <span class="n">gzip</span><span class="o">.</span><span class="n">open</span><span class="p">(</span><span class="n">args</span><span class="o">.</span><span class="n">left_input</span><span class="p">,</span><span class="s">&#39;rb&#39;</span><span class="p">)</span></div><div class='line' id='LC27'>		<span class="n">right</span> <span class="o">=</span> <span class="n">gzip</span><span class="o">.</span><span class="n">open</span><span class="p">(</span><span class="n">args</span><span class="o">.</span><span class="n">right_input</span><span class="p">,</span><span class="s">&#39;rb&#39;</span><span class="p">)</span></div><div class='line' id='LC28'>		<span class="n">fout</span> <span class="o">=</span> <span class="n">gzip</span><span class="o">.</span><span class="n">open</span><span class="p">(</span><span class="n">args</span><span class="o">.</span><span class="n">output</span><span class="p">,</span><span class="s">&#39;wb&#39;</span><span class="p">)</span></div><div class='line' id='LC29'><br/></div><div class='line' id='LC30'>	<span class="k">else</span><span class="p">:</span></div><div class='line' id='LC31'>		<span class="n">left</span> <span class="o">=</span> <span class="nb">open</span><span class="p">(</span><span class="n">args</span><span class="o">.</span><span class="n">left_input</span><span class="p">,</span><span class="s">&#39;rU&#39;</span><span class="p">)</span></div><div class='line' id='LC32'>		<span class="n">right</span> <span class="o">=</span> <span class="nb">open</span><span class="p">(</span><span class="n">args</span><span class="o">.</span><span class="n">right_input</span><span class="p">,</span><span class="s">&#39;rU&#39;</span><span class="p">)</span></div><div class='line' id='LC33'>		<span class="n">fout</span> <span class="o">=</span> <span class="nb">open</span><span class="p">(</span><span class="n">args</span><span class="o">.</span><span class="n">output</span><span class="p">,</span><span class="s">&#39;wb&#39;</span><span class="p">)</span></div><div class='line' id='LC34'><br/></div><div class='line' id='LC35'><br/></div><div class='line' id='LC36'>	<span class="c"># USING A WHILE LOOP MAKE THIS SUPER FAST</span></div><div class='line' id='LC37'>	<span class="c"># Details here: </span></div><div class='line' id='LC38'>	<span class="c">#   http://effbot.org/zone/readline-performance.htm</span></div><div class='line' id='LC39'><br/></div><div class='line' id='LC40'>	<span class="k">while</span> <span class="mi">1</span><span class="p">:</span> </div><div class='line' id='LC41'><br/></div><div class='line' id='LC42'>		<span class="c"># process the first file</span></div><div class='line' id='LC43'>		<span class="n">left_line</span> <span class="o">=</span> <span class="n">left</span><span class="o">.</span><span class="n">readline</span><span class="p">()</span></div><div class='line' id='LC44'>		<span class="k">if</span> <span class="ow">not</span> <span class="n">left_line</span><span class="p">:</span> <span class="k">break</span></div><div class='line' id='LC45'>		<span class="n">fout</span><span class="o">.</span><span class="n">write</span><span class="p">(</span><span class="n">left_line</span><span class="p">)</span></div><div class='line' id='LC46'><br/></div><div class='line' id='LC47'>		<span class="n">left_line</span> <span class="o">=</span> <span class="n">left</span><span class="o">.</span><span class="n">readline</span><span class="p">()</span></div><div class='line' id='LC48'>		<span class="n">fout</span><span class="o">.</span><span class="n">write</span><span class="p">(</span><span class="n">left_line</span><span class="p">)</span></div><div class='line' id='LC49'><br/></div><div class='line' id='LC50'>		<span class="n">left_line</span> <span class="o">=</span> <span class="n">left</span><span class="o">.</span><span class="n">readline</span><span class="p">()</span></div><div class='line' id='LC51'>		<span class="n">fout</span><span class="o">.</span><span class="n">write</span><span class="p">(</span><span class="n">left_line</span><span class="p">)</span></div><div class='line' id='LC52'><br/></div><div class='line' id='LC53'>		<span class="n">left_line</span> <span class="o">=</span> <span class="n">left</span><span class="o">.</span><span class="n">readline</span><span class="p">()</span></div><div class='line' id='LC54'>		<span class="n">fout</span><span class="o">.</span><span class="n">write</span><span class="p">(</span><span class="n">left_line</span><span class="p">)</span></div><div class='line' id='LC55'><br/></div><div class='line' id='LC56'><br/></div><div class='line' id='LC57'>		<span class="c"># process the second file</span></div><div class='line' id='LC58'>		<span class="n">right_line</span> <span class="o">=</span> <span class="n">right</span><span class="o">.</span><span class="n">readline</span><span class="p">()</span></div><div class='line' id='LC59'>		<span class="n">fout</span><span class="o">.</span><span class="n">write</span><span class="p">(</span><span class="n">right_line</span><span class="p">)</span></div><div class='line' id='LC60'><br/></div><div class='line' id='LC61'>		<span class="n">right_line</span> <span class="o">=</span> <span class="n">right</span><span class="o">.</span><span class="n">readline</span><span class="p">()</span></div><div class='line' id='LC62'>		<span class="n">fout</span><span class="o">.</span><span class="n">write</span><span class="p">(</span><span class="n">right_line</span><span class="p">)</span></div><div class='line' id='LC63'><br/></div><div class='line' id='LC64'>		<span class="n">right_line</span> <span class="o">=</span> <span class="n">right</span><span class="o">.</span><span class="n">readline</span><span class="p">()</span></div><div class='line' id='LC65'>		<span class="n">fout</span><span class="o">.</span><span class="n">write</span><span class="p">(</span><span class="n">right_line</span><span class="p">)</span></div><div class='line' id='LC66'><br/></div><div class='line' id='LC67'>		<span class="n">right_line</span> <span class="o">=</span> <span class="n">right</span><span class="o">.</span><span class="n">readline</span><span class="p">()</span></div><div class='line' id='LC68'>		<span class="n">fout</span><span class="o">.</span><span class="n">write</span><span class="p">(</span><span class="n">right_line</span><span class="p">)</span></div><div class='line' id='LC69'><br/></div><div class='line' id='LC70'>	<span class="n">left</span><span class="o">.</span><span class="n">close</span><span class="p">()</span></div><div class='line' id='LC71'>	<span class="n">right</span><span class="o">.</span><span class="n">close</span><span class="p">()</span></div><div class='line' id='LC72'>	<span class="n">fout</span><span class="o">.</span><span class="n">close</span><span class="p">()</span></div><div class='line' id='LC73'>	<span class="k">return</span> <span class="mi">0</span></div><div class='line' id='LC74'><br/></div><div class='line' id='LC75'><span class="k">if</span> <span class="n">__name__</span> <span class="o">==</span> <span class="s">&#39;__main__&#39;</span><span class="p">:</span></div><div class='line' id='LC76'>	<span class="n">args</span> <span class="o">=</span> <span class="n">interface</span><span class="p">()</span></div><div class='line' id='LC77'>	<span class="n">process_reads</span><span class="p">(</span><span class="n">args</span><span class="p">)</span></div></pre></div>
          </div>

          <div class="gist-meta">
            <a href="https://gist.github.com/raw/2232505/606bb6a964272095c26a1f893a9d149608955c4d/interleave_fastq.py" style="float:right;">view raw</a>
            <a href="https://gist.github.com/2232505#file_interleave_fastq.py" style="float:right;margin-right:10px;color:#666">interleave_fastq.py</a>
            <a href="https://gist.github.com/2232505">This Gist</a> brought to you by <a href="http://github.com">GitHub</a>.
          </div>
        </div>
</div>

]]></content:encoded>
			<wfw:commentRss>http://www.ngcrawford.com/2012/03/28/interleave-paired-end-reads/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Bowtie2 output as BAM</title>
		<link>http://www.ngcrawford.com/2012/03/14/bowtie2-output-as-bam/</link>
		<comments>http://www.ngcrawford.com/2012/03/14/bowtie2-output-as-bam/#comments</comments>
		<pubDate>Wed, 14 Mar 2012 14:01:46 +0000</pubDate>
		<dc:creator>Nick</dc:creator>
				<category><![CDATA[Software]]></category>
		<category><![CDATA[bowtie2]]></category>
		<category><![CDATA[pipe]]></category>
		<category><![CDATA[samtools]]></category>

		<guid isPermaLink="false">http://www.ngcrawford.com/?p=442</guid>
		<description><![CDATA[Bowtie2 is a short read aligner that is optimized for aligning longer reads of lengths of 50 bp or greater. I&#8217;ve been playing around with it and was initially puzzled by the fact that it only outputs SAM formated alignments. Then I realized you can pipe the output straight into samtools which will do the [...]]]></description>
			<content:encoded><![CDATA[<p><a title="bowtie2" href="http://bowtie-bio.sourceforge.net/bowtie2/index.shtml">Bowtie2</a> is a short read aligner that is optimized for aligning longer reads of lengths of 50 bp or greater. I&#8217;ve been playing around with it and was initially puzzled by the fact that it only outputs <a title="SAM format spec" href="http://samtools.sourceforge.net/samtools.shtml#5" target="_blank">SAM formated</a> alignments. Then I realized you can pipe the output straight into <a title="samtools" href="http://samtools.sourceforge.net/samtools.shtml" target="_blank">samtools</a> which will do the compression to BAM for you.</p>


<div class="wp-geshi-highlight-wrap5"><div class="wp-geshi-highlight-wrap4"><div class="wp-geshi-highlight-wrap3"><div class="wp-geshi-highlight-wrap2"><div class="wp-geshi-highlight-wrap"><div class="wp-geshi-highlight"><div class="sh"><pre class="de1">$ bowtie2 \
-p 4 \
-x /genome/index \
-1 pair2.fastq \
-2 pair2.fastq \
-U unpaired.fastq \
--very-sensitive \
-X 1000 \
-I 200 \
| samtools view -bS - &gt; output.bam</pre></div></div></div></div></div></div></div>


]]></content:encoded>
			<wfw:commentRss>http://www.ngcrawford.com/2012/03/14/bowtie2-output-as-bam/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Installing ABySS</title>
		<link>http://www.ngcrawford.com/2010/12/05/installing-abyss/</link>
		<comments>http://www.ngcrawford.com/2010/12/05/installing-abyss/#comments</comments>
		<pubDate>Sun, 05 Dec 2010 19:08:08 +0000</pubDate>
		<dc:creator>Nick</dc:creator>
				<category><![CDATA[Blog]]></category>
		<category><![CDATA[ABySS]]></category>
		<category><![CDATA[Installation]]></category>
		<category><![CDATA[Instructions]]></category>

		<guid isPermaLink="false">http://www.ngcrawford.com/?p=344</guid>
		<description><![CDATA[To install ABySS on an system running an older version of gcc and use the following commands. &#62;./configure &#8211;enable-maxk=96 &#8211;disable-openmp \ CPPFLAGS=-I&#60;path to google-sparsehash install&#62;/include \ &#8211;prefix=&#60;ABySS install directory&#62;/ &#62;make AM_CXXFLAGS=-Wall &#62;make install]]></description>
			<content:encoded><![CDATA[<p><!-- p.p1 {margin: 0.0px 0.0px 0.0px 0.0px; font: 13.0px Arial} p.p2 {margin: 0.0px 0.0px 0.0px 0.0px; font: 13.0px Arial; min-height: 15.0px} p.p3 {margin: 0.0px 0.0px 0.0px 0.0px; font: 13.0px Helvetica} -->To install <a href="http://www.bcgsc.ca/platform/bioinfo/software/abyss">ABySS</a> on an system running an older version of gcc and use the following commands.</p>
<blockquote><p>&gt;./configure &#8211;enable-maxk=96 &#8211;disable-openmp \</p>
<p>CPPFLAGS=-I&lt;path to <a href="http://code.google.com/p/google-sparsehash/">google-sparsehash</a> install&gt;/include \</p>
<p>&#8211;prefix=&lt;ABySS install directory&gt;/</p>
<p>&gt;make AM_CXXFLAGS=-Wall</p>
<p>&gt;make install</p></blockquote>
]]></content:encoded>
			<wfw:commentRss>http://www.ngcrawford.com/2010/12/05/installing-abyss/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Word by word in Terminal</title>
		<link>http://www.ngcrawford.com/2010/09/27/word-by-word-in-terminal/</link>
		<comments>http://www.ngcrawford.com/2010/09/27/word-by-word-in-terminal/#comments</comments>
		<pubDate>Mon, 27 Sep 2010 15:33:42 +0000</pubDate>
		<dc:creator>Nick</dc:creator>
				<category><![CDATA[Blog]]></category>

		<guid isPermaLink="false">http://www.ngcrawford.com/?p=339</guid>
		<description><![CDATA[One of the annoying things in OS terminal is that if you want to traverse word by word in a line of text you need to type &#8216;esc-b&#8217; and &#8216;exc-f&#8217;.  This post on macromates, the textmate blog, explains how to reset these keys. Enjoy.]]></description>
			<content:encoded><![CDATA[<p>One of the annoying things in OS terminal is that if you want to traverse word by word in a line of text you need to type &#8216;esc-b&#8217; and &#8216;exc-f&#8217;.  This post on <a href="http://blog.macromates.com/2006/word-movement-in-terminal/">macromates</a>, the textmate blog, explains how to reset these keys. Enjoy.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.ngcrawford.com/2010/09/27/word-by-word-in-terminal/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>MultiMarkdown</title>
		<link>http://www.ngcrawford.com/2010/06/09/multimarkdown/</link>
		<comments>http://www.ngcrawford.com/2010/06/09/multimarkdown/#comments</comments>
		<pubDate>Wed, 09 Jun 2010 15:56:58 +0000</pubDate>
		<dc:creator>Nick</dc:creator>
				<category><![CDATA[Blog]]></category>
		<category><![CDATA[Software]]></category>

		<guid isPermaLink="false">http://www.ngcrawford.com/?p=336</guid>
		<description><![CDATA[Since I started using github in a serious way back in January I&#8217;ve begun writing my documentation in the  markdown format that displays so nicely on github. Markdown is essentially a parsing tool and a simple text syntax that allows the easy conversion of human &#8216;readable text&#8217; to html. It&#8217;s intuitive, it took less than 5 minutes to pick up, [...]]]></description>
			<content:encoded><![CDATA[<p>Since I started using <a href="http://www.github.com">github</a> in a serious way back in January I&#8217;ve begun writing my documentation in the  <a href="http://daringfireball.net/projects/markdown/">markdown</a> format that displays so nicely on github. Markdown is essentially a parsing tool and a simple text syntax that allows the easy conversion of human &#8216;readable text&#8217; to html. It&#8217;s intuitive, it took less than 5 minutes to pick up, and saves me a ton of time not writing HTML. However, its ease of use is tempered, a bit, by a lack of features.  Although it is easy to create headers, lists, and code bocks &#8211; simple HTML stuff &#8211; it doesn&#8217;t include the option to create tables, formated mathematical formulas, citations and bibliographies.  Since I&#8217;m a scientist who wants to produce documents with these sorts of features, this is annoying.</p>
<p>Luckily, the markdown syntax has recently been extended, in a project called <a href="http://fletcherpenney.net/multimarkdown/">MultiMarkdown</a>, to include many of the aforementioned features. Multimarkdown essentially merges the markdown syntax with <a href="http://www.latex-project.org/">LaTeX</a> which, if you haven&#8217;t heard of it, is a rather inscrutable, but extremely powerful text formatting language.  It&#8217;s popular in the CS and physics disciplines.  LaTeX produces beautiful documents, but it&#8217;s easy to spend a week or more adjusting the formatting and reading the API trying to figure out some of the more complicated features.  Multimarkdown looks like it will do much of the more basic LaTeX formatting, but without the headache.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.ngcrawford.com/2010/06/09/multimarkdown/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Vertebrate Zoology: Bi302</title>
		<link>http://www.ngcrawford.com/2010/01/22/vertebrate-zoology-bi302/</link>
		<comments>http://www.ngcrawford.com/2010/01/22/vertebrate-zoology-bi302/#comments</comments>
		<pubDate>Fri, 22 Jan 2010 14:21:11 +0000</pubDate>
		<dc:creator>Nick</dc:creator>
				<category><![CDATA[Courses]]></category>

		<guid isPermaLink="false">http://www.ngcrawford.com/?p=329</guid>
		<description><![CDATA[Welcome to Vertebrate Zoology lab (Bi302 Lab).  I&#8217;ll be using this portion of my website to post notes/slides and to answer your questions.  Please make use of the comments. More to come.]]></description>
			<content:encoded><![CDATA[<p>Welcome to Vertebrate Zoology lab (Bi302 Lab).  I&#8217;ll be using this portion of my website to post notes/slides and to answer your questions.  Please make use of the comments. More to come.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.ngcrawford.com/2010/01/22/vertebrate-zoology-bi302/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>40 Essential Tools and Resources to Visualize Data &#124; FlowingData</title>
		<link>http://www.ngcrawford.com/2009/12/16/40-essential-tools-and-resources-to-visualize-data-flowingdata/</link>
		<comments>http://www.ngcrawford.com/2009/12/16/40-essential-tools-and-resources-to-visualize-data-flowingdata/#comments</comments>
		<pubDate>Wed, 16 Dec 2009 20:06:14 +0000</pubDate>
		<dc:creator>Nick</dc:creator>
				<category><![CDATA[Blog]]></category>
		<category><![CDATA[Software]]></category>

		<guid isPermaLink="false">http://www.ngcrawford.com/?p=327</guid>
		<description><![CDATA[This looks incredibly useful.  I really need to sit down and learn Flash and Processing. 40 Essential Tools and Resources to Visualize Data &#124; FlowingData.]]></description>
			<content:encoded><![CDATA[<p>This looks incredibly useful.  I really need to sit down and learn Flash and Processing.</p>
<p><a href="http://flowingdata.com/2008/10/20/40-essential-tools-and-resources-to-visualize-data/">40 Essential Tools and Resources to Visualize Data | FlowingData</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.ngcrawford.com/2009/12/16/40-essential-tools-and-resources-to-visualize-data-flowingdata/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
		<item>
		<title>Wallace&#8217;s Insect Collection Found!</title>
		<link>http://www.ngcrawford.com/2009/11/25/wallaces-insect-collection-found/</link>
		<comments>http://www.ngcrawford.com/2009/11/25/wallaces-insect-collection-found/#comments</comments>
		<pubDate>Wed, 25 Nov 2009 13:31:12 +0000</pubDate>
		<dc:creator>Nick</dc:creator>
				<category><![CDATA[Blog]]></category>

		<guid isPermaLink="false">http://www.ngcrawford.com/?p=325</guid>
		<description><![CDATA[Via the New York Times The owner wanted a sum that far exceeded Mr. Heggestad’s budget — a colossal $600. “I was just out of law school, I had no money and no business buying it,” he said. But the owner was willing to take installments of $100 a month, and into Mr. Heggestad’s possession [...]]]></description>
			<content:encoded><![CDATA[<p>Via the New York Times</p>
<p><span style="font-family: Georgia, serif; line-height: 22px; font-size: 15px;"></p>
<blockquote><p>The owner wanted a sum that far exceeded Mr. Heggestad’s budget — a colossal $600. “I was just out of law school, I had no money and no business buying it,” he said. But the owner was willing to take installments of $100 a month, and into Mr. Heggestad’s possession fell an incomparable scientific treasure.</p>
<p>The cabinet belonged to Alfred Russel Wallace, the English naturalist who conceived the idea of evolution through natural selection independently of <a style="color: #004276; text-decoration: underline;" title="More articles about Charles Robert Darwin." href="http://topics.nytimes.com/top/reference/timestopics/people/d/charles_robert_darwin/index.html?inline=nyt-per">Charles Darwin</a>.</p></blockquote>
<p></span></p>
<p>Wow!</p>
<p><a href="http://www.nytimes.com/2009/11/24/science/24cabi.html">Museum to Display Historic Cabinet That Belonged to Alfred Russel Wallace &#8211; NYTimes.com</a>.</p>
]]></content:encoded>
			<wfw:commentRss>http://www.ngcrawford.com/2009/11/25/wallaces-insect-collection-found/feed/</wfw:commentRss>
		<slash:comments>0</slash:comments>
		</item>
	</channel>
</rss>

