6

I am quite new to python and programming in general, but I am trying to run a "sliding window" calculation over a tab delimited .txt file that contains about 7 million lines with python. What I mean by sliding window is that it will run a calculation over say 50,000 lines, report the number and then move up say 10,000 lines and perform the same calculation over another 50,000 lines. I have the calculation and the "sliding window" working correctly and it runs well if I test it on a a small subset of my data. However, if i try to run the program over my entire data set it is incredibly slow (i've had it running now for about 40 hours). The math is quite simple so I don't think it should be taking this long.

The way I am reading my .txt file right now is with the csv.DictReader module. My code is as follows:

file1='/Users/Shared/SmallSetbee.txt'
newfile=open(file1, 'rb')
reader=csv.DictReader((line.replace('\0','') for line in newfile), delimiter="\t")

I believe that this is making a dictionary out of all 7 million lines at once, which I'm thinking could be the reason it slows down so much for the larger file.

Since I am only interested in running my calculation over "chunks" or "windows" of data at a time, is there a more efficient way to read in only specified lines at a time, perform the calculation and then repeat with a new specified "chunk" or "window" of specified lines?

Vadim Kotov
  • 8,084
  • 8
  • 48
  • 62
flz416
  • 61
  • 2
  • 1
    This does not make a dictionary of all lines at once. It makes a dictionary for each line. This means that the snippet you posted is not the cause of your performance woes. Perhaps you could show us some more code? – Steven Rumbalski Nov 15 '12 at 16:15
  • 1
    I suspect that if you're doing calculations over large sets of table-like data you might want to look at Pandas: http://pandas.pydata.org/pandas-docs/dev/io.html#iterating-through-files-chunk-by-chunk Everything you're trying to do has probably already been done before 1000 times better. – Iguananaut Nov 15 '12 at 16:27
  • You will run this calculation on 696 "windows". How long does it take for a single window on a 50k line file? – Steven Rumbalski Nov 15 '12 at 16:33
  • Do you always move the "window" up by a fixed size or a multiple of a fixed size number of lines. Also what version of Python are you using? – martineau Nov 15 '12 at 16:39
  • 1
    Profile your code and see exactly where it's spending most of its time. – martineau Nov 15 '12 at 16:45
  • 1
    see [`sliding_window(iterable, size, step, fillvalue)`](http://stackoverflow.com/a/13408251/4279) – jfs Nov 15 '12 at 23:48

2 Answers2

6

A collections.deque is an ordered collection of items which can take a maximum size. When you add an item to one end, one falls of the other end. This means that to iterate over a "window" on your csv, you just need to keep adding rows to the deque and it will handle throwing away complete ones already.

dq = collections.deque(maxlen=50000)
with open(...) as csv_file:
    reader = csv.DictReader((line.replace("\0", "") for line in csv_file), delimiter="\t")

    # initial fill
    for _ in range(50000):
        dq.append(reader.next())

    # repeated compute
    try:
        while 1:
            compute(dq)
            for _ in range(10000):
                dq.append(reader.next())
    except StopIteration:
            compute(dq)
Katriel
  • 120,462
  • 19
  • 136
  • 170
  • 1
    `try/except` should be closer to `reader.next()` to avoid accidentally catching `StopIteration` from `compute(dq)` – jfs Nov 15 '12 at 19:41
3

Don't use csv.DictReader, instead use csv.reader. It takes longer to create a dictionary for each row than it takes to create a list for each row. Additionally, it is marginally faster to access a list by an index than it is to access a dictionary by a key.

I timed iteration over a 300,000 line 4 column csv file using the two csv readers. csv.DictReader took seven times longer than a csv.reader.

Combine this with katrielalex's suggestion to use collections.deque and you should see a nice speedup.

Additionally, profile your code to pinpoint where you are spending most of your time.

Community
  • 1
  • 1
Steven Rumbalski
  • 44,786
  • 9
  • 89
  • 119