Python: performance issues with islice

Question

With the following code, I'm seeing longer and longer execution times as I increase the starting row in islice. For example, a start_row of 4 will execute in 1s but a start_row of 500004 will take 11s. Why does this happen and is there a faster way to do this? I want to be able to iterate over several ranges of rows in a large CSV file (several GB) and make some calculations.

import csv
import itertools
from collections import deque
import time

my_queue = deque()

start_row = 500004
stop_row = start_row + 50000

with open('test.csv', 'rb') as fin:
    #load into csv's reader
    csv_f = csv.reader(fin)

    #start logging time for performance
    start = time.time()

    for row in itertools.islice(csv_f, start_row, stop_row):
        my_queue.append(float(row[4])*float(row[10]))

    #stop logging time
    end = time.time()
    #display performance
    print "Initial queue populating time: %.2f" % (end-start)

Related: http://stackoverflow.com/questions/620367/how-to-jump-to-a-particular-line-in-a-huge-text-file (although, uh, the accepted answer will fail miserably if you try to use it here) — NightShadeQueen, Jul 04 '15 at 23:16
So don't use the accepted answer, use something based on [Rosenfield's](http://stackoverflow.com/a/620492/355230). — martineau, Jul 05 '15 at 20:02

score 2 · Answer 1 · answered Jul 04 '15 at 23:26

For example, a start_row of 4 will execute in 1s but a start_row of 500004 will take 11s

That is islice being intelligent. Or lazy, depending on which term you prefer.

Thing is, files are "just" strings of bytes on your hard drive. They don't have any internal organization. \n is just another set of bytes in that long, long string. There is no way to access any particular line without looking at all of the information before it (unless your lines are of the exact same length, in which case you can use file.seek).

Line 4? Finding line 4 is fast, your computer just needs to find 3 \n. Line 50004? Your computer has to read through the file until it finds 500003 \n. No way around it, and if someone tells you otherwise, they either have some other sort of quantum computer or their computer is reading through the file just like every other computer in the world, just behind their back.

As for what you can do about it: Try to be smart when trying to grab lines to iterate over. Smart, and lazy. Arrange your requests so you're only iterating through the file once, and close the file as soon as you've pulled the data you need. (islice does all of this, by the way.)

In python

lines_I_want = [(start1, stop1), (start2, stop2),...]
with f as open(filename):
     for i,j in enumerate(f):
          if i >= lines_I_want[0][0]:
              if i >= lines_I_want[0][1]:
                   lines_I_want.pop(0)
                   if not lines_I_want: #list is empty
                         break
              else:
                   #j is a line I want. Do something

And if you have any control over making that file, make every line the same length so you can seek. Or use a database.

score 0 · Answer 2 · edited May 23 '17 at 11:58

The problem with using islice() for what you're doing is that iterates through all the lines before the first one you want before returning anything. Obviously the larger the starting row, the longer this will take. Another is that you're using a csv.reader to read these lines, which incurs likely unnecessary overhead since one line of the csv file is often one row of it. The only time that's not true is when the csv file has string fields in it that contain embedded newline characters — which in my experience is uncommon.

If this is a valid assumption for your data, it would likely be much faster to first index the file and build a table of (filename, offset, number-of-rows) tuples indicating the approximately equally-sized logical chunks of lines/rows in the file. With that, you can process them relatively quickly by first seeking to the starting offset and then reading the specified number of csv rows from that point on.

Another advantage to this approach is it would allow you to process the chunks in parallel, which I suspect is is the real problem you're trying to solve based on a previous question of yours. So, even though you haven't mentioned multiprocessing here, this following has been written to be compatible with doing that, if that's the case.

import csv
from itertools import islice
import os
import sys

def open_binary_mode(filename, mode='r'):
    """ Open a file proper way (depends on Python verion). """
    kwargs = (dict(mode=mode+'b') if sys.version_info[0] == 2 else
              dict(mode=mode, newline=''))
    return open(filename, **kwargs)

def split(infilename, num_chunks):
    infile_size = os.path.getsize(infilename)
    chunk_size = infile_size // num_chunks
    offset = 0
    num_rows = 0
    bytes_read = 0
    chunks = []
    with open_binary_mode(infilename, 'r') as infile:
        for _ in range(num_chunks):
            while bytes_read < chunk_size:
                try:
                    bytes_read += len(next(infile))
                    num_rows += 1
                except StopIteration:  # end of infile
                    break
            chunks.append((infilename, offset, num_rows))
            offset += bytes_read
            num_rows = 0
            bytes_read = 0
    return chunks

chunks = split('sample_simple.csv', num_chunks=4)
for filename, offset, rows in chunks:
    print('processing: {} rows starting at offset {}'.format(rows, offset))
    with open_binary_mode(filename, 'r') as fin:
        fin.seek(offset)
        for row in islice(csv.reader(fin), rows):
            print(row)

Hi martineau, I have tried your code with a 4.5GB csv file and it took about 42 seconds in the split function. I've also used the 'index every line' method that NightShadeQueen posted in a link above. It too takes 42 seconds. It seems that the total time is dictated by the fact that we must read every line in the file, and not which offsets we store. — Colin, Jul 09 '15 at 03:39
Indexing every line would require a lot more memory, even if doesn't take longer. At best if would allow you could divide the file up into chunks of about the same number lines regardless of their length, which is different than into chunks of lines that together represent about the same number of bytes. — martineau, Jul 09 '15 at 07:10

Python: performance issues with islice

2 Answers2

Linked