I been working with a lot of very large files and it has become increasing obvious that using a single processor core  is a major bottleneck to getting my data processed in a timely fashion. A MapReduce style algorithm seemed like the way to go, but I had a hard time finding a useful example. After a bit of hacking about I came up with the following code.

The basic algorithmic idea is to first read in a large chunk of lines from the file. These are then partitioned out to the available cores and processed independently.  The new set of lines are then written to an output file or in this example just printed to the screen. Normally this would be tricky code to write, but python 2.7’s wonderful multiprocessing module handles all the synchronization for you.

Tagged with →  
Share →

2 Responses to Python: Multiprocessing large files

  1. at says:

    A mistake in the code make the results beeing overwritten at each loop.

    Here is the final loop corrected:

    results = []
    for chunk in grouper(10, test_data):
    result = p.map(process_chunk, chunk)
    results.extend(result)
    for r in results:
    print r # replace with outfile.write()

    • Nick says:

      Results should be overwritten at each loop, but after your write them to an outfile. If I extended the results list as you suggest this would mean that all the results are stored in memory which kinda defeats the purpose of this script. Overwriting results isn’t a bug, it’s a feature! ;)

Leave a Reply

Your email address will not be published. Required fields are marked *

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong> <pre lang="" line="" escaped="" cssfile="">