7

I have a large .csv file that is well over 300 gb. I would like to chunk it into smaller files of 100,000,000 rows each (each row has approximately 55-60 bytes).

I wrote the following code:

import pandas as pd
df = pd.read_csv('/path/to/really/big.csv',header=None,chunksize=100000000)
count = 1
for chunk in df:
    name = '/output/to/this/directory/file_%s.csv' %s count
    chunk.to_csv(name,header=None,index=None)
    print(count)
    count+=1

This code works fine, and I have plenty of memory on disk to store the approximate 5.5-6 gb at a time, but it's slow.

Is there a better way?

EDIT

I have written the following iterative solution:

with open('/path/to/really/big.csv', 'r') as csvfile:
    read_rows = csv.reader(csvfile)
    file_count = 1
    row_count = 1
    f = open('/output/to/this/directory/file_%s.csv' %s count,'w')
    for row in read_rows:
        f.write(''.join(row))
        row_count+=1
        if row_count % 100000000 == 0:
            f.close()
            file_count += 1
            f = open('/output/to/this/directory/file_%s.csv' %s count,'w')

EDIT 2

I would like to call attention to Vor's comment about using a Unix/Linux split command, this is the fastest solution I have found.

invoker
  • 507
  • 3
  • 7
  • 18
  • 3
    No need to reinvent the wheel use split http://linux.die.net/man/1/split . If you definitely want it in python then wrap if in `subprocess` module. – Vor Sep 23 '15 at 15:27
  • I will definitely give this a try as well! – invoker Sep 23 '15 at 15:33
  • 1
    Why even use the csv module? You're just reading and writing lines. The input and output format of the lines remain the same. Also, `for row in read_rows` can be `for row_count, row in enumerate(read_rows, start=1)` and then you don't need to track the row number. – Steven Rumbalski Sep 23 '15 at 17:17

2 Answers2

12

there is an existing tool for this in Unix/Linux.

split -l 100000 -d source destination

will add two digit numerical suffix to destination prefix for the chunks.

karakfa
  • 66,216
  • 7
  • 41
  • 56
4

You don't really need to read all that data into a pandas DataFrame just to split the file - you don't even need to read the data all into memory at all. You could seek to the approximate offset you want to split at, then scan forward until you find a line break, and loop reading much smaller chunks from the source file into a destination file between your start and end offsets. (This approach assumes your CSV doesn't have any column values with embedded newlines.)

SMALL_CHUNK = 100000

def write_chunk(source_file, start, end, dest_name):
    pos = start
    source_file.seek(pos)
    with open(dest_name, 'w') as dest_file:
        for chunk_start in range(start, end, SMALL_CHUNK):
            chunk_end = min(chunk_start + SMALL_CHUNK, end)
            dest_file.write(source_file.read(chunk_end - chunk_start))

Actually, an intermediate solution could be to use the csv module - that would still parse all of the lines in the file, which isn't strictly necessary, but would avoid reading huge arrays into memory for each chunk.

babbageclunk
  • 8,523
  • 1
  • 33
  • 37
  • I guess Pandas handles this natively? I am not familiar with Pandas at all. So would you agree that the solution I posted about using generators would be invalid for this scenario? – idjaw Sep 23 '15 at 15:20
  • 2
    Your answer relies on having all the lines in memory, which isn't feasible for the big file invoker is dealing with. – babbageclunk Sep 23 '15 at 15:22
  • OK. Looks like I learned something here as well. :) Thanks. I'll remove my solution. – idjaw Sep 23 '15 at 15:25
  • I like this, do you know how it performs relative to reading into memory or just iterating? I just wrote another snippet of code that I'm running now that iterates, I will put it in the edit above. – invoker Sep 23 '15 at 15:32
  • I haven't benchmarked but I'd expect it to be much faster - it doesn't need to parse each line into lists, just scan through for the next line boundary at each big jump (which won't take long since each row is only ~60 bytes). – babbageclunk Sep 23 '15 at 16:12