Processing a huge amount of files in python

Question

I have a huge number of report files (about 650 files) which takes about 320 M of hard disk and I want to process them. There are a lot of entries in each file; I should count and log them based on their content. Some of them are related to each other and I should find, log and count them too; matches may be in different files. I have wrote a simple script to do the job. I used python profiler and it just took about 0.3 seconds to run the script for one single file with 2000 lines that we need half of them for processing. But for the whole directory it took 1 hour and a half to be done. This is how my script looks like:

# imports

class Parser(object):
    def __init__(self):
        # load some configurations
        # open some log files
        # set some initial values for some variables

    def parse_packet(self, tags):
        # extract some values from line

    def found_matched(self, packet):
        # search in the related list to find matched line

    def save_packet(self, packet):
        # write the line in the appropriate files and increase or decrease some counters

    def parse(self, file_addr):
        lines = [l for index, l in enumerate(open(file_addr, 'r').readlines()) if index % 2 != 0]
        for line in lines:
            packet = parse_packet(line)
            if found_matched(packet):
                # count
            self.save_packet(packet)

    def process_files(self):
        if not os.path.isdir(self.src_dir):
            self.log('No such file or directory: ' + str(self.src_dir))
            sys.exit(1)
        input_dirs = os.walk(self.src_dir)
        for dname in input_dirs:
            file_list = dname[2]
            for fname in file_list:
                self.parse(os.path.join(dname[0], fname))
        self.finalize_process()

    def finalize_process(self):
        # closing files

I want to decrease the time at least to the 10% percent of current execution time. Maybe multiprocessing can help me or just some enhancement in current script will do the task. Anyway could you please help me in this?

Edit 1:

I have changed my code according to @Reut Sharabani's answer:

def parse(self, file_addr):
    lines = [l for index, l in enumerate(open(file_addr, 'r').readlines()) if index % 2 != 0]
    for line in lines:
        packet = parse_packet(line)
        if found_matched(packet):
            # count
        self.save_packet(packet)

def process_files(self):
    if not os.path.isdir(self.src_dir):
        self.log('No such file or directory: ' + str(self.src_dir))
        sys.exit(1)
    input_dirs = os.walk(self.src_dir)
    for dname in input_dirs:
        process_pool = multiprocessing.Pool(10)
        for fname in file_list:
            file_list = [os.path.join(dname[0], fname) for fname in dname[2]]
            process_pool.map(self.parse, file_list)
    self.finalize_process()

I also added below lines before my class definition to avoid PicklingError: Can't pickle <type 'instancemethod'>: attribute lookup __builtin__.instancemethod failed:

import copy_reg
import types

def _pickle_method(m):
    if m.im_self is None:
        return getattr, (m.im_class, m.im_func.func_name)
    else:
        return getattr, (m.im_self, m.im_func.func_name)

copy_reg.pickle(types.MethodType, _pickle_method)

Another thing that I have done into my code was not to keep open log files during file processing; I open and close them for writing each entry just to avoid ValueError: I/O operation on closed file.

Now the problem is that I have some files which are being processed multiple times. I also got wrong counts for my packets. What did I do wrong? Should I put process_pool = multiprocessing.Pool(10) before the for loop? Consider that I have just one directory right now and it doesn't seem to be the problem.

EDIT 2:

I also tried using ThreadPoolExecutor this way:

with ThreadPoolExecutor(max_workers=10) as executor:
    for fname in file_list:
        executor.submit(self.parse, fname)

Results were correct, but it took an hour and a half to be completed.

Take a look at celery - it is a task queue. You can have a process queuing a task for each file and any numbers of "worker" processes doing actual work. But if the task is I/O bound, multitasking will help only to the point where you saturate the I/O system. — Paulo Scardine, Jan 22 '15 at 12:06
This discussion of a Python implementation for Tim Bray's Wide Finder 2 challenge might also be useful -- http://effbot.org/zone/wide-finder.htm — michel-slm, Jan 22 '15 at 12:15
Use a thread pool. This is probably a lot more I\O blocked than cpu blocked. — Reut Sharabani, Jan 22 '15 at 12:15

Dr. Jan-Philip Gehrcke · Answer 1 · 2015-01-22T12:53:55.863

First of all, "about 650 files which takes about 320 M" is not a lot. Given that modern hard disks easily read and write 100 MB/s, the I/O performance of your system probably is not your bottleneck (also supported by "it just took about 0.3 seconds to run the script for one single file with 2000 lines", which clearly indicates CPU-limitation). However, the exact way you are reading files from within Python may not be efficient.

Furthermore, a simple multiprocessing-based architecture, run on a common multi core system, will allow you to perform your analysis much faster (no need to involve celery here, no need to cross machine boundaries).

multiprocessing architecture

Just have a look at multiprocessing, your architecture likely will involve one manager process (the parent), which defines a task Queue, and a Pool of worker processes. The manager (or feeder) puts tasks (e.g. file names) into the queue, and the workers consume these. After finishing with a task, a worker lets the manager know, and proceeds consuming the next one.

file processing method

This is quite inefficient:

    lines = [l for index, l in enumerate(open(file_addr, 'r').readlines()) if index % 2 != 0]
    for line in lines:
        ...

readlines() reads the entire file before the list comprehension is evaluated. Only after that you again iterate through all lines. Hence, you iterate three times through your data. Combine everything into a single loop, so that you iterate the lines only once.

score 1 · Answer 2 · edited May 23 '17 at 12:20

You should be using threads here. If you're blocked by cpu later, you can use processes.

To explain I first created a ten thousand files (0.txt ... 9999.txt), with a line count that's equivalent to the name (+1), using this command:

for i in `seq 0 999`; do for j in `seq 0 $i`; do echo $i >> $i.txt; done ; done

Next, I've created a python script using a ThreadPool with 10 threads to count the lines of all files that have an even value:

#!/usr/bin/env python

from multiprocessing.pool import ThreadPool
import time
import sys

print "creating %s threads" % sys.argv[1]
thread_pool = ThreadPool(int(sys.argv[1]))

files = ["%d.txt" % i for i in range(1000)]

def count_even_value_lines(filename):
    with open(filename, 'r') as f:
        # do some processing
        line_count = 0
        for line in f.readlines():
            if int(line.strip()) % 2 == 0:
                line_count += 1

        print "finished file %s" % filename
        return line_count

start = time.time()
print sum(thread_pool.map(count_even_value_lines, files))
total = time.time() - start
print total

As you can see this takes no time, and the results are correct. 10 files are processed in parallel and the cpu is fast enough to handle the results. If you want even more you may consider using threads and processes to utilize all cpus as well as not letting IO block you.

Edit:

As comments suggest, I was wrong and this is not I/O blocked, so you can speed it up using multiprocessing (cpu blocked). Because I used a ThreadPool which has the same interface as Pool you can make minimal edits and have the same code running:

#!/usr/bin/env python

import multiprocessing
import time
import sys


files = ["%d.txt" % i for i in range(2000)]
# function has to be defined before pool is opened and workers are forked
def count_even_value_lines(filename):
    with open(filename, 'r') as f:
        # do some processing
        line_count = 0
        for line in f:
            if int(line.strip()) % 2 == 0:
                line_count += 1

        return line_count


print "creating %s processes" % sys.argv[1]
process_pool = multiprocessing.Pool(int(sys.argv[1]))

start = time.time()
print sum(process_pool.map(count_even_value_lines, files))
total = time.time() - start
print total

Results:

me@EliteBook-8470p:~/Desktop/tp$ python tp.py 1
creating 1 processes
25000000
21.2642059326
me@EliteBook-8470p:~/Desktop/tp$ python tp.py 10
creating 10 processes
25000000
12.4360249043

Although `multiprocessing.ThreadPool` may be nice, I have a bad feeling to throw an *undocumented* feature at such a simple problem. SO is not only about providing solutions, but also about education. — Dr. Jan-Philip Gehrcke, Jan 22 '15 at 12:29
@Jan-PhilipGehrcke thanks, I actually tested and you're right. I edited the answer to include the fix. — Reut Sharabani, Jan 22 '15 at 12:42
If I was right, you may want to upvote my answer, where I explain *why* this problem is not I/O-limited :-). — Dr. Jan-Philip Gehrcke, Jan 22 '15 at 12:53
About your first comment, I feel like ThreadPool can be learned just like multiprocessing.Pool. This is a task where you have need workers, and pools are just for that, so that's the right way to go. If you **don't** use ThreadPool you'd end up implementing something similar and that's a bit backwards. — Reut Sharabani, Jan 22 '15 at 12:56
I also find it quite elegant. Still, `ThreadPool` from multiprocessing is not a canonical approach to take, it is not documented: https://docs.python.org/2/library/multiprocessing.html. That's all I wanted to say: throwing an undocumented part of the stdlib at a simple problem like this is weird, from a pedagogical point of view. — Dr. Jan-Philip Gehrcke, Jan 22 '15 at 13:00

PeterE · Answer 3 · 2015-01-22T14:07:59.167

Aside from using parallel processing, your parse method is rather inefficient as @Jan-PhilipGehrcke already pointed out. To expand on his recommendation: The classical variant:

def parse(self, file_addr):
    with open(file_addr, 'r') as f:
        line_no = 0
        for line in f:
            line_no += 1
            if line_no % 2 != 0:
                packet = parse_packet(line)
                if found_matched(packet):
                    # count
                self.save_packet(packet)

Or using your style (assuming you use python 3):

def parse(self, file_addr):
    with open(file_addr, 'r') as f:
        filtered = (l for index,l in enumerate(f) if index % 2 != 0)
        for line in filtered:
            # and so on

The thing to notice here, is the use of iterators, all operations to build the filtered list (which is not actually a list!!) operate on and return iterators, which means that at no point the entire file is loaded into a list.

Processing a huge amount of files in python

3 Answers3

multiprocessing architecture

file processing method

Edit: