Looping through big files takes hours in Python

Question

This is my second day working in Python .I worked on this in C++ for a while, but decided to try Python. My program works as expected. However, when I process one file at a time without the glob loop, it takes about a half hour per file. When I include the glob, the loop takes about 12 hours to process 8 files.

My question is this, is there anything in my program that is definitely slowing it down? is there anything I should be doing to make it faster?

I have a folder of large files. For example

file1.txt (6gb) file2.txt (5.5gb) file3.txt (6gb)

If it helps, each line of data begins with a character that tells me how the rest of the characters are formatted, which is why I have all of the if elif statements. A line of data would look like this: T35201 M352 RZNGA AC

I am trying to read each file, do some parsing using splits, and then save the file.

The computer has 32gb of ram, so my method is to read each file into ram, and then loop through the file, and then save, clearing ram for the next file.

I've included the file so you can see the methods that I am using. I use an if elif statement that uses about 10 different elif commands. I have tried a dictionary, but I couldn't figure that out to save my life.

Any answers would be helpful.

import csv
import glob

for filename in glob.glob("/media/3tb/5may/*.txt"):
    f = open(filename,'r')
    c = csv.writer(open(filename + '.csv','wb'))

    second=0
    mill=0
    for line in f.readlines():
       #print line
        event=0
        ticker=0
        marketCategory=0
        variable = line[0:1]    

        if variable is 'T':
           second = line[1:6]
           mill=0
        else: 
           second = second 

        if variable is 'R':
           ticker = line[1:7]   
           marketCategory = line[7:8]
        elif variable is ...
        elif variable is ...
        elif ...
        elif ...
        elif ...
        elif ...
        elif        

        if variable (!= 'T') and (!= 'M')
            c.writerow([second,mill,event ....]) 
   f.close()

UPDATE Each of the elif statements are nearly identical. The only parts that change are the ways that I split the lines. Here are two elif statements (There are 13 total, and they are almost all identical except for the way that they are split.)

  elif variable is 'C':
     order = line[1:10]
     Shares = line[10:16]
     match = line[16:25]
     printable = line[25:26]
     price = line[26:36]
   elif variable is 'P':
     ticker = line[17:23]
     order = line[1:10]
     buy = line[10:11]
     shares = line[11:17]
     price = line[23:33]
     match = line[33:42]

UPDATE2 I have ran the code using for file in f two different times. The first time I ran a single file without for filename in glob.glob("/media/3tb/file.txt"): and it took about 30 minutes manually coding the file path for one file.

I ran it again with for filename in glob.glob("/media/3tb/*file.txt") and it took an hour just for one file in the folder. Does the glob code add that much time?

One thing would be to change `for line in f.readlines():` to `for line in f:`. This way you don't read the whole file to memory at once, but rather one line at a time. Also `variable = line[0:1]` is the same as `variable = line[0]`, tho it doesn't really affect the speed. — , Feb 22 '13 at 14:05
use `line[0] == 'T'` instead of `variable is 'T'`. The latter might fail (`is` tests object identity and there could be more than one `'T'` object). — jfs, Feb 22 '13 at 14:09
what version of python are you using? You could also set a buffersize and load the file in using chuncks: buffersize = 50000000, buffer = infile.read(buffersize), while len(buffer): stuff here — Drewdin, Feb 22 '13 at 14:22
John Zwinck and Markus are right to eschew `readlines`. Also, really consider moving those 20 gigabytes from csv files to a database of some kind. — Colonel Panic, Feb 22 '13 at 14:46
Isn't `if variable (!= 'T') and (!= 'M')` a syntax error? Shouldn't it be `if variable != 'T' and variable != 'M':` or `if variable not in ('T', 'M'):`? Also shouldn't the handle given to `csv.writer` be closed manually later (or used in a `with`)? I am new as well :) — William, Feb 22 '13 at 15:10
You've paraphrased your code, quite apart from the bits you've omitted `if variable (!= 'T') and (!= 'M')` isn't even valid syntax. Most likely something in the body of the loop is taking a while but you haven't given us that code. — Duncan, Feb 22 '13 at 15:15
Are you sure you're testing the same two files? Since you're specifying the file, and glob is choosing whatever the first one it has is, there could be a large difference in file size. Fire up the debugger and check what glob is returning would be my first thought. — Jason White, Feb 22 '13 at 18:00

score 9 · Accepted Answer · answered Feb 22 '13 at 14:06

9

Here:

for line in f.readlines():

You should just do this:

for line in f:

The former reads the entire file into a list of lines, then iterates over that list. The latter does it incrementally, which should drastically reduce the total memory allocated and later freed by your program.

answered Feb 22 '13 at 14:06

John Zwinck

239,568
38
324
436

I've tried doing this for a single file, and the speed is about the same, 30 minutes. With Python, is there no I/O bottleneck if I don't read it all into memory? I will loop through a couple files and will update on the time using your method. Thank you again. – BrianR Feb 22 '13 at 14:43
You can check if your program consumes 100% CPU or anywhere near it--if not, it's probably limited by I/O. If you are reading the same input files over and over (in subsequent runs of the program), you should consider writing a translator which reads the CSV and writes a NumPy "ndarray," then subsequent runs can just load the array and operate on that, which should be somewhat faster. In the end, however, you need to decide what performance criteria you have, and if you need it to be really fast, you need to choose a different language for the "hot" part of the code at least. – John Zwinck Feb 23 '13 at 01:01

score 2 · Answer 2 · answered Feb 22 '13 at 14:18

Whenever you ask "what part of this is slowing down the whole thing?" the answer is "profile it." There's an excellent description of how to do this in Python's documentation at The Python Profilers. Also, as John Zwinck points out, you're loading too much into memory at once and should be only loading one line at a time (file objects are "iterable" in Python).

Personally, I prefer what Perl calls "dispatch table" to a huge if..elif...elif monstrosity. This webpage describes a Pythonic way of doing it. It's a dictionary of keys to functions, which doesn't work in all cases but for simple if x==2:...elif x==3... (i.e., switching on the value of one variable) it works great.

score 1 · Answer 3 · answered Feb 22 '13 at 18:17

1

Use an iterable (by using yield) to 'buffer' more lines into memory than just one line at a time but NOT the whole file at a time.

def readManyLines(fObj,num=1000):
  lines = fObj.readlines(num)
  for line in lines:
    yield line

f = open(filename,'r')
for line in readManyLines(f):
  process(line)

answered Feb 22 '13 at 18:17

g19fanatic

10,567
6
33
63

Here, num is the number of bytes to read. It will try to read that many bytes then read more bytes to make a full line. file.readlines(num) will always return full lines and not stop after num bytes. – g19fanatic Feb 22 '13 at 18:21

score 0 · Answer 4 · answered Feb 22 '13 at 18:36

Not sure if this helps at all, but try using this instead of the glob.glob just to rule out that being the problem. I'm on windows so I can't be 100% certain this works under unix, but I don't see why it wouldn't.

import re
import os
import csv

def find_text_files(root):
    """Find .txt files under a given directory"""
    foundFiles = []
    for dirpath, dirnames, filenames in os.walk(root):
        for file in filenames:
            txt = re.compile(r'txt$',re.I,).search(file)
            if txt:
                foundFiles.append(os.path.join(dirpath,file))
    return foundFiles

txtfiles = find_text_files('d:\files') #replace the path with yours

for filename in txtfiles:
    f = open(filename,'r')
    c = csv.writer(open(filename + '.csv','wb'))

Looping through big files takes hours in Python

4 Answers4

Linked