split a csv file by content of first column without creating a copy?

Question

I am trying to accomplish something similar to what was described in this thread: How to split a huge csv file based on content of first column?

There, the best solution seemed to be use awk which does do the job. However, I am dealing with very massive csv files and I would like to split up the file without creating a new copy since the disk I/O speed is killing me. Is there a way to split the original file without creating a new copy?

you just want to have a new file, which contains only the 1st column data from your monster.txt? — Kent, Jun 20 '12 at 13:59
no, i want to split the original file into smaller files based on what the value of the first column is. the first column can be assumed to be sorted. however, i would like to do this with as little I/O as possible, so hopefully trying to split the huge file in place on disk. — user788171, Jun 20 '12 at 14:51
Use a RAM disk, or reduce your I/O with scheduling utility like *ionice* if your platform provides one. — Todd A. Jacobs, Jun 20 '12 at 19:45

score 3 · Accepted Answer · answered Jun 20 '12 at 13:57

3

I'm not really sure what you're asking, but if your question is: "Can I take a huge file on disk and split it 'in-place' so I get many smaller files without actually having to write those smaller files to disk?", then the answer is no.

You will need to iterate through the first file and write the "segments" back to disk as new files, regardless of whether you use awk, Python or a text editor for this. You do not need to make a copy of the first file beforehand, though.

answered Jun 20 '12 at 13:57

Tim Pietzcker

328,213
58
503
561

this is unfortunate, i was really hoping to be able to split it in place without having to re-write all the data to disk. – user788171 Jun 20 '12 at 14:53

score 0 · Answer 2 · answered Jun 20 '12 at 19:56

"Splitting a file" still requires RAM and disk I/O. There's no way around that; it's just how the world works.

However, you can certainly reduce the impact of I/O-bound processes on your system. Some obvious solutions are:

Use a RAM disk to reduce disk I/O.
Use a SAN disk to reduce local disk I/O.
Use an I/O scheduler to rate-limit your disk I/O. For example, most Linux systems support the ionice utility for this purpose.
Chunk up the file and use batch queues to reduce CPU load.
Use nice to reduce CPU load during file processing.

If you're dealing with files, then you're dealing with I/O. It's up to you to make the best of it within your system contraints.

split a csv file by content of first column without creating a copy?

2 Answers2