How to split a huge csv file based on content of first column?

Question

I have a 250MB+ huge csv file to upload
file format is group_id, application_id, reading and data could look like

1, a1, 0.1
1, a1, 0.2
1, a1, 0.4
1, a1, 0.3
1, a1, 0.0
1, a1, 0.9
2, b1, 0.1
2, b1, 0.2
2, b1, 0.4
2, b1, 0.3
2, b1, 0.0
2, b1, 0.9
.....
n, x, 0.3(lets say)

I want to divide the file based on group_id, so output should be n files where n=group_id

Output

File 1

1, a1, 0.1
1, a1, 0.2
1, a1, 0.4
1, a1, 0.3
1, a1, 0.0
1, a1, 0.9

and

File2
2, b1, 0.1
2, b1, 0.2
2, b1, 0.4
2, b1, 0.3
2, b1, 0.0
2, b1, 0.9
.....

and

File n
n, x, 0.3(lets say)

How can I do this effectively?

Are the rows sorted by `group_id`? – senderle Feb 28 '12 at 20:22 — senderle, Feb 28 '12 at 20:22
Is it expected that the group id is already sorted? – aweis Feb 28 '12 at 20:22 — aweis, Feb 28 '12 at 20:22

Zsolt Botykai · Answer 1 · 2012-02-29T09:13:21.293

19

awk is capable:

 awk -F "," '{print $0 >> ("FILE" $1)}' HUGE.csv

edited Feb 29 '12 at 09:13

answered Feb 28 '12 at 21:18

Zsolt Botykai

50,406
14
85
110

Oh, yes. That's better than my way. Though, you're missing the first single quote on that command. And I think daydreamer wanted `("File" $1)`. – Mike Feb 28 '12 at 21:24
I'm getting "awk: [ONE_OUTPUT_FILENAME] makes too many open files" when using this approach after generating the first 17 files. Adding '; close("FILE" $1)' to the awk command solved that problem. – Karl Bartel Jun 13 '17 at 11:47

score 10 · Accepted Answer · answered Feb 28 '12 at 20:31

10

If the file is already sorted by group_id, you can do something like:

import csv
from itertools import groupby

for key, rows in groupby(csv.reader(open("foo.csv")),
                         lambda row: row[0]):
    with open("%s.txt" % key, "w") as output:
        for row in rows:
            output.write(",".join(row) + "\n")

answered Feb 28 '12 at 20:31

Fred Foo

355,277
75
744
836

2

you can use `operator.itemgetter(0)` instead of the ugly lambda – John La Rooy Feb 28 '12 at 20:39
1

Is `operator.itemgetter(0)` actually less ugly than `lambda row: row[0]`? – Steven Rumbalski Feb 28 '12 at 22:23
@StevenRumbalski: in any case, it's faster than a `lambda`. I left it out because I feared it might be confusing. – Fred Foo Feb 28 '12 at 22:30
Unfortunately, groupby is confused by empty rows. for example, it stop when it sees empty row[0] and starts again afterward. – Omar Jun 26 '15 at 19:12
@Pavlos: no, groupby will only work as expected if the data is already sorted. The unix command "sort" should be all you need to sort a simple CSV file by the first column. – Karl Bartel Jun 13 '17 at 11:44
Any ideas how we'd stop the key from being written to the output file? – Nick Duddy Mar 18 '19 at 17:29

Mike · Answer 3 · 2012-02-28T21:01:41.263

Sed one-liner:

sed -e '/^1,/wFile1' -e '/^2,/wFile2' -e '/^3,/wFile3' ... OriginalFile

The only down-side is that you need to put in n -e statements (represented by the ellipsis, which shouldn't appear in the final version). So this one-liner might be a pretty long line.

The upsides, though, are that it only makes one pass through the file, no sorting is assumed, and no python is needed. Plus, it's a one-freaking-liner!

score 2 · Answer 4 · answered Feb 28 '12 at 20:23

If the rows are sorted by group_id, then itertools.groupby would be useful here. Because it's an iterator, you won't have to load the whole file into memory; you can still write each file line by line. Use csv to load the file (in case you didn't already know about it).

score 1 · Answer 5 · answered Feb 28 '12 at 20:22

1

If they are sorted by the group id you can use the csv module to iterate over the rows in the files and output it. You can find information about the module here.

answered Feb 28 '12 at 20:22

aweis

5,350
4
30
46

score 1 · Answer 6 · answered Feb 28 '12 at 20:23

How about:

Read the input file a line at a time
split() each line on , to get the group_id
For each new group_id you find, open an output file
- add each groupid to a set/dict as you find them so you can keep track
write the line to the appropriate file
Done!

score 0 · Answer 7 · answered Mar 17 '22 at 23:13

Here's a solution that works with the IDs sorted or unsorted. The only overhead for the unsorted version is opening the destination (group ID) CSV multiple times:

import csv

reader = csv.reader(open("test.csv", newline=""))

prev_id = None
out_file = None
writer = None

for row in reader:
    this_id = row[0]

    if this_id != prev_id:
        if out_file is not None:
            out_file.close()

        fname = f"file_{this_id}.csv"
        out_file = open(fname, "a", newline="")
        writer = csv.writer(out_file)
        prev_id = this_id

    writer.writerow(row)

Here's test input, but now with 1 & 2 interleaved:

1, a1, 0.1
2, b1, 0.1
1, a1, 0.2
2, b1, 0.2
1, a1, 0.4
2, b1, 0.4
1, a1, 0.3
2, b1, 0.3
1, a1, 0.0
2, b1, 0.0
1, a1, 0.9
2, b1, 0.9

When I run it I see:

./main.py
opening file_1.csv for appending...
opening file_2.csv for appending...
opening file_1.csv for appending...
opening file_2.csv for appending...
opening file_1.csv for appending...
opening file_2.csv for appending...
opening file_1.csv for appending...
opening file_2.csv for appending...
opening file_1.csv for appending...
opening file_2.csv for appending...
opening file_1.csv for appending...
opening file_2.csv for appending...

and my output files look like:

1, a1, 0.1
1, a1, 0.2
1, a1, 0.4
1, a1, 0.3
1, a1, 0.0
1, a1, 0.9

and

2, b1, 0.1
2, b1, 0.2
2, b1, 0.4
2, b1, 0.3
2, b1, 0.0
2, b1, 0.9

I also created a fake BIG file, 289MB, with 100 ID groups (250_000 rows per ID), and my solution ran in about 12 seconds. For comparison, the accepted answer that uses groupby() runs in about 10 seconds on the big CSV; the high-rated awk script runs in about a minute.

score 0 · Answer 8 · answered Feb 28 '12 at 20:42

Here some food for though for you:

import csv
from collections import namedtuple

csvfile = namedtuple('scvfile',('file','writer'))

class CSVFileCollections(object):

    def __init__(self,prefix,postfix):
        self.prefix = prefix
        self.files = {}

    def __getitem__(self,item):
        if item not in self.files:
            file = open(self.prefix+str(item)+self.postfix,'wb')
            writer = csv.writer(file,delimiter = ',', quotechar = "'",quoting=csv.QUOTE_MINIMAL)
            self.files[item] = csvfile(file,writer) 
        return self.files[item].writer

    def __enter__(self): pass

    def __exit__(self, exc_type, exc_value, traceback):
        for csvfile in self.files.values() : csvfile.file.close()


with open('huge.csv') as readFile, CSVFileCollections('output','.csv') as output:
    reader = csv.reader(readFile, delimiter=",", quotechar="'")
    for row in reader:
        writer = output[row[0]]
        writer.writerow(row)

This is broken for me w/Python 3.8.9: `writer = output[row[0]]; TypeError: 'NoneType' object is not subscriptable` — Zach Young, Mar 17 '22 at 17:39

How to split a huge csv file based on content of first column?

8 Answers8

Linked

Related