Filtering contigs/chromosomes from a multi-fasta file

A colleague needed to remove some individual fastas from a multi-fasta file. Googling didn’t reveal a canned way to do it so I hacked up this script. 8.29.12 – As Jason Gallant pointed out, if your fasta is very small you don’t need to index your fasta file. Just  use the simple biopython code he mentions in […]

Python: Multiprocessing large files

I been working with a lot of very large files and it has become increasing obvious that using a single processor core  is a major bottleneck to getting my data processed in a timely fashion. A MapReduce style algorithm seemed like the way to go, but I had a hard time finding a useful example. […]

Python: Interleave Paired-End Reads

Here’s a simple script for interleaving paired-end fastq files. You’ll need to do this if you want to create input files for velvet. Unlike the velvet’s shuffleSequences_fastq.pl perl script, this script handles gzipped input and output. It requires python 2.7.

Blue Collar Bioinformatics

Just wanted to recommend Blue Collar Bioinformatics a slick blog with lots of useful bioinformatics scripts.  Everything is written in python and the full working source is typically available on GIT.

How to emulate Blast’s “Short Sequence Parameters”

I just spent an hour figuring out how to emulate Blast’s “Short Sequence Search Parameters” in BioPython 1.48.   To use PAM30 as your matrix you must use existence and extension parameters (e.g. gap costs) of 9 and 1. Here’s what I’ve currently got: result_handle = NCBIWWW.qblast( “blastp”, “nr”, seq_record.seq.tostring(), matrix_name = ‘BLOSUM62′, word_size=’2′, expect=’30000′, gapcosts =’9 […]