How to split a large CSV file conditionally in Windows or Linux?

Question

I have a CSV file that's way too large for Excel and I'd like to split it into smaller files. I already found a way to do this here, and it even preserves the header row which is great.

However, I want to add some conditions to this process. For example, there's a "Confirmed" column in the file and possible row values are "0", "1", or "" (empty).

Is it possible to only keep rows where this column value is "1" in the smaller files? I have some other conditions too, but I want to know at least where to start with this.

Do you have any experience w/Python? I can share a script that I think even a beginner could grasp and manipulate to suit their needs. — Zach Young, Apr 17 '23 at 19:15

Zach Young · Answer 1 · 2023-04-17T21:39:54.083

I like the command-line utility GoCSV, by aotimme. It follows the Unix philosophy of having a number of small tools, each of which does one thing very well, and the tools can be pipelined. It also has pre-built binaries for Linux and Windows.

I mocked up this sample input based on the info in your question:

_ID__,Confirmed
00001,1
00002,0
00003,0
00004,1
00005,1
...
09996,1
09997,0
09998,0
09999,0
10000,1

GoCSV's filter and split subcommands can be piped together to first filter out any "non 1" row; then break up the remaining "1" rows into files of 1000 rows each:

gocsv filter -c Confirmed -eq 1 input.csv | gocsv split --max-rows 999

The filter subcommand specifies with column to consider, -c Confirmed the Confirmed column, then -eq 1 to specify that only rows with a 1 in the Confirmed column should be output.

GoCSV always treats the first row as the header (a number of its subcommands only make sense if they interpret the first row as a header), so I subtracted 1 for --max-rows.

For my mock input.csv, that yielded 5 output CSVs:

ls out*.csv | while read CSV; do
    echo "--$CSV--"
    gocsv dims $CSV
done

--out-1.csv--
Dimensions:
  Rows: 999
  Columns: 2
--out-2.csv--
Dimensions:
  Rows: 999
  Columns: 2
--out-3.csv--
Dimensions:
  Rows: 999
  Columns: 2
--out-4.csv--
Dimensions:
  Rows: 999
  Columns: 2
--out-5.csv--
Dimensions:
  Rows: 979
  Columns: 2

Again, GoCSV doesn't count the header as a row, so the Rows count is only 999 for the complete files.

Fravadona · Answer 2 · 2023-04-18T05:40:45.430

2

You could use Miller (available here for several OSs) for this task:

mlr --csv filter '$Confirmed == 1' then split -n 999 --prefix 'file' file.csv

The filter verb has an awk-like syntax where you can use whatever conditions you might need.

edited Apr 18 '23 at 05:40

answered Apr 17 '23 at 19:41

Fravadona

13,917
1
23
35

score 1 · Answer 3 · answered Apr 18 '23 at 16:18

With Python's standard CSV module you could iterate over the input CSV rows, filling up a buffer of rows (where Confirmed == 1), and flush that buffer to numbered output files when the buffer gets big enough (999):

import csv
import sys

BATCH_SIZE = 999
OUTNAME_TMPL = "output_XX.csv"

# Replace XX in the template above with this incrementing idx
out_idx = 1

rows_buf: list[dict[str, str]]
rows_buf = []

f_in = open("input.csv", encoding="utf-8", newline="")
reader = csv.DictReader(f_in)

# .fieldnames can return None, so check below
header = reader.fieldnames
if header is None:
    print("error: expected a header row; got None", file=sys.stderr)
    sys.exit(1)


def flush_rows():
    """Write rows_buf to a CSV numbered with out_idx, and with header."""
    fname = OUTNAME_TMPL.replace("XX", str(out_idx))

    with open(fname, "w", encoding="utf-8", newline="") as f_out:
        writer = csv.DictWriter(f_out, fieldnames=header)
        writer.writeheader()
        writer.writerows(rows_buf)


for row in reader:
    if len(rows_buf) == BATCH_SIZE:
        flush_rows()
        out_idx += 1
        rows_buf = []

    if row["Confirmed"] == "1":
        rows_buf.append(row)

# Flush any remaining rows
if rows_buf:
    flush_rows()

If you haven't used the DictReader/Writer before, the reader yields a dict of dict[str,str] where the key is the column, and the value is the field data for that row-and-column. The writer needs to be initialized with the fieldnames (the global var header in the script), then its write methods can take those same dicts (from the reader) and properly output them as CSV.

How to split a large CSV file conditionally in Windows or Linux?

3 Answers3