25

In a basic I had the next process.

import csv
reader = csv.reader(open('huge_file.csv', 'rb'))

for line in reader:
    process_line(line)

See this related question. I want to send the process line every 100 rows, to implement batch sharding.

The problem about implementing the related answer is that csv object is unsubscriptable and can not use len.

>>> import csv
>>> reader = csv.reader(open('dataimport/tests/financial_sample.csv', 'rb'))
>>> len(reader)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: object of type '_csv.reader' has no len()
>>> reader[10:]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: '_csv.reader' object is unsubscriptable
>>> reader[10]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: '_csv.reader' object is unsubscriptable

How can I solve this?

smci
  • 32,567
  • 20
  • 113
  • 146
Mario César
  • 3,699
  • 2
  • 27
  • 42
  • Can you please answer this question? https://stackoverflow.com/q/73414323/19168443?sem=2 – akash Aug 19 '22 at 10:59

3 Answers3

30

Just make your reader subscriptable by wrapping it into a list. Obviously this will break on really large files (see alternatives in the Updates below):

>>> reader = csv.reader(open('big.csv', 'rb'))
>>> lines = list(reader)
>>> print lines[:100]
...

Further reading: How do you split a list into evenly sized chunks in Python?


Update 1 (list version): Another possible way would just process each chuck, as it arrives while iterating over the lines:

#!/usr/bin/env python

import csv
reader = csv.reader(open('4956984.csv', 'rb'))

chunk, chunksize = [], 100

def process_chunk(chuck):
    print len(chuck)
    # do something useful ...

for i, line in enumerate(reader):
    if (i % chunksize == 0 and i > 0):
        process_chunk(chunk)
        del chunk[:]  # or: chunk = []
    chunk.append(line)

# process the remainder
process_chunk(chunk)

Update 2 (generator version): I haven't benchmarked it, but maybe you can increase performance by using a chunk generator:

#!/usr/bin/env python

import csv
reader = csv.reader(open('4956984.csv', 'rb'))

def gen_chunks(reader, chunksize=100):
    """ 
    Chunk generator. Take a CSV `reader` and yield
    `chunksize` sized slices. 
    """
    chunk = []
    for i, line in enumerate(reader):
        if (i % chunksize == 0 and i > 0):
            yield chunk
            del chunk[:]  # or: chunk = []
        chunk.append(line)
    yield chunk

for chunk in gen_chunks(reader):
    print chunk # process chunk

# test gen_chunk on some dummy sequence:
for chunk in gen_chunks(range(10), chunksize=3):
    print chunk # process chunk

# => yields
# [0, 1, 2]
# [3, 4, 5]
# [6, 7, 8]
# [9]

There is a minor gotcha, as @totalhack points out:

Be aware that this yields the same object over and over with different contents. This works fine if you plan on doing everything you need to with the chunk between each iteration.

miku
  • 181,842
  • 47
  • 306
  • 310
  • 2
    The problem is that subscribing the file, force to read all the lines of the file. This is a really huge file and memory usage raise to much if I do that. – Mario César Feb 10 '11 at 12:28
  • 1
    @Mario: Added a generator version, which might be faster (but I didn't have time to test it - maybe you do). – miku Feb 10 '11 at 12:48
  • It's correct the second argument on enumerate() "takes exactly 1 argument (2 given)" ? – Mario César Feb 10 '11 at 12:52
  • for gen_chunks(range(10), chunksize=2) → I get [[],[1],[3],[5],[7],[9]] may something be wrong? – Mario César Feb 10 '11 at 12:55
  • You'll need 2.6 or higher for the `start` parameter (http://docs.python.org/library/functions.html#enumerate). – miku Feb 10 '11 at 12:56
  • @Mario: Ok, corrected my answer, should work now, even on earlier python versions (got rid of the `start` parameter on `enumerate`). – miku Feb 10 '11 at 13:04
  • Thanks, however I am getting this → [0, 1, 2] [4, 5] [7, 8], could it be related also with python2.5? – Mario César Feb 10 '11 at 13:09
  • @Mario: See my current answer, it should be correct and work with python 2.2 onwards. – miku Feb 10 '11 at 13:11
  • 2
    @Mario: Wah, that's irritating. Here is another gist (https://gist.github.com/820490), just tried it myself with python 2.5. If that doesn't solve it, I'm out of options (and time ;) for this answer. Good luck! – miku Feb 10 '11 at 13:17
  • @TheMYYN :-) Thanks any way, it's a great solution it just need more work, I would keep testing and post it complete :) – Mario César Feb 10 '11 at 13:18
  • Good answer but I found it twice as fast to split the file in linux using the split command, and then read in each chunk – radtek Aug 09 '17 at 21:47
  • 3
    **Minor gotcha**: be aware that this yields the same object over and over with different contents. This works fine if you plan on doing everything you need to with the chunk between each iteration. That's likely the intent in most cases, but if that behavior doesn't work for your situation, one option is to change `del chunk[:]` to `chunk = []`. – totalhack Oct 16 '19 at 16:00
  • @totalhack I used the above code logic, but some rows get duplicated, even though there is no duplicate in the file, how can I avoid this issue? – Kar Sep 29 '20 at 17:30
  • @totalhack I used chunk =[] – Kar Sep 29 '20 at 17:36
7

We can use pandas module to handle these big csv files.

df = pd.DataFrame()
temp = pd.read_csv('BIG_File.csv', iterator=True, chunksize=1000)
df = pd.concat(temp, ignore_index=True)
Debashis Sahoo
  • 5,388
  • 5
  • 36
  • 41
2

There isn't a good way to do this for all .csv files. You should be able to divide the file into chunks using file.seek to skip a section of the file. Then you have to scan one byte at a time to find the end of the row. The you can process the two chunks independently. Something like the following (untested) code should get you started.

file_one = open('foo.csv')
file_two = open('foo.csv') 
file_two.seek(0, 2)     # seek to the end of the file
sz = file_two.tell()    # fetch the offset
file_two.seek(sz / 2)   # seek back to the middle
chr = ''
while chr != '\n':
    chr = file_two.read(1)
# file_two is now positioned at the start of a record
segment_one = csv.reader(file_one)
segment_two = csv.reader(file_two)

I'm not sure how you can tell that you have finished traversing segment_one. If you have a column in the CSV that is a row id, then you can stop processing segment_one when you encounter the row id from the first row in segment_two.

D.Shawley
  • 58,213
  • 10
  • 98
  • 113