Here's a solution that works with the IDs sorted or unsorted. The only overhead for the unsorted version is opening the destination (group ID) CSV multiple times:
import csv
reader = csv.reader(open("test.csv", newline=""))
prev_id = None
out_file = None
writer = None
for row in reader:
this_id = row[0]
if this_id != prev_id:
if out_file is not None:
out_file.close()
fname = f"file_{this_id}.csv"
out_file = open(fname, "a", newline="")
writer = csv.writer(out_file)
prev_id = this_id
writer.writerow(row)
Here's test input, but now with 1 & 2 interleaved:
1, a1, 0.1
2, b1, 0.1
1, a1, 0.2
2, b1, 0.2
1, a1, 0.4
2, b1, 0.4
1, a1, 0.3
2, b1, 0.3
1, a1, 0.0
2, b1, 0.0
1, a1, 0.9
2, b1, 0.9
When I run it I see:
./main.py
opening file_1.csv for appending...
opening file_2.csv for appending...
opening file_1.csv for appending...
opening file_2.csv for appending...
opening file_1.csv for appending...
opening file_2.csv for appending...
opening file_1.csv for appending...
opening file_2.csv for appending...
opening file_1.csv for appending...
opening file_2.csv for appending...
opening file_1.csv for appending...
opening file_2.csv for appending...
and my output files look like:
1, a1, 0.1
1, a1, 0.2
1, a1, 0.4
1, a1, 0.3
1, a1, 0.0
1, a1, 0.9
and
2, b1, 0.1
2, b1, 0.2
2, b1, 0.4
2, b1, 0.3
2, b1, 0.0
2, b1, 0.9
I also created a fake BIG file, 289MB, with 100 ID groups (250_000 rows per ID), and my solution ran in about 12 seconds. For comparison, the accepted answer that uses groupby()
runs in about 10 seconds on the big CSV; the high-rated awk script runs in about a minute.