Python: Multiprocessing large files

I been working with a lot of very large files and it has become increasing obvious that using a single processor core ย is a major bottleneck to getting my data processed in a timely fashion. A MapReduce style algorithm seemed like the way to go, but I had a hard time finding a useful example. After a bit of hacking about I came up with the following code.

The basic algorithmic idea is to first read in a large chunk of lines from the file. These are then partitioned out to the available cores and processed independently. ย The new set of lines are then written to an output file or in this example just printed to the screen. Normally this would be tricky code to write, but python 2.7’s wonderful multiprocessing module handles all the synchronization for you.

4 thoughts on “Python: Multiprocessing large files

  1. A mistake in the code make the results beeing overwritten at each loop.

    Here is the final loop corrected:

    results = []
    for chunk in grouper(10, test_data):
    result =, chunk)
    for r in results:
    print r # replace with outfile.write()

    1. Results should be overwritten at each loop, but after your write them to an outfile. If I extended the results list as you suggest this would mean that all the results are stored in memory which kinda defeats the purpose of this script. Overwriting results isn’t a bug, it’s a feature! ๐Ÿ˜‰

  2. Thanks for the nice example!

    Would really like to have a similar one on sharing memory across processes too, if you have the time ๐Ÿ™‚

Leave a Reply

Your email address will not be published. Required fields are marked *