split large csv text file based on column value

Question

I have CSV files that have multiple columns that are sorted. For instance, I might have lines like this:

19980102,,PLXS,10032,Q,A,,,15.12500,15.00000,15.12500,2
19980105,,PLXS,10032,Q,A,,,14.93750,14.75000,14.93750,2
19980106,,PLXS,10032,Q,A,,,14.56250,14.56250,14.87500,2
20111222,,PCP,63830,N,A,,,164.07001,164.09000,164.12000,1
20111223,,PCP,63830,N,A,,,164.53000,164.53000,164.55000,1
20111227,,PCP,63830,N,A,,,165.69000,165.61000,165.64000,1

I would like to divide up the file based on the 3rd column, e.g. put PLXS and PCP entries into their own files called PLXS.csv and PCP.csv. Because the file happens to be pre-sorted, all of the PLXS entries are before the PCP entries and so on.

I generally end up doing things like this in C++ since that's the language I know the best, but in this case, my input CSV file is several gigabytes and too large to load into memory in C++.

Can somebody show how this can be accomplished? Perl/Python/php/bash solutions are all okay, they just need to be able to handle the huge file without excessive memory usage.

have you browsed around at all? several related questions on this site in all above languages and more. you can search: `site:stackoverflow.com csv split by value` or some such variant. best of luck — mechanical_meat, Mar 30 '12 at 23:02

Sean Summers · Answer 1 · 2019-02-22T17:55:54.293

39

Here's an old school one liner for you (just replace the >> with > to truncate the output files each run):

awk -F, '{print >> ($3".csv")}' input.csv

Due to popular demand (and an itch I just had), I've also written a version that will duplicate the header lines to all files:

awk -F, 'NR==1 {h=$0; next} {f=$3".csv"} !($3 in p) {p[$3]; print h > f} {print >> f}' input.csv

But you could just start with this and finish with the first awk:

HDR=$(head -1 input.csv); for fn in $(tail -n+2 input.csv | cut -f3 -d, | sort -u); do echo $HDR > $fn.csv; done

Most modern systems have the awk binary included, but if you don't have it, you can find an exe at Gawk for Windows

edited Feb 22 '19 at 17:55

answered Apr 02 '12 at 23:24

Sean Summers

2,514
19
26

this is awesome :) it'd be even better if we could retain the headers – Yurui Zhang May 20 '15 at 16:55
1

There weren't headers in the original. Perhaps you can ask a different question? – Sean Summers May 21 '15 at 02:35
This helped me - but to improve - the headers example can be simplified by using `NR==1 {hdr=$0; next}` then you don't need to check NR on the subsequent patterns as you cannot get to these patterns AND not be NR>1 due to ordering rule of Awk. From man page: `Each pattern in the program then shall be evaluated in the order of occurrence, and the action associated with each pattern that matches the current record executed.` – Phil Aug 30 '18 at 12:22
Also a minor improvement - the filename pattern `{fn=$3".csv"}` should occur after the NR==1 pattern, as it is only used by patterns following this. – Phil Aug 30 '18 at 12:30
Thank you for the feedback, @Phil. I don't believe a profiler would show any change or improvement after your suggestions and I believe obfuscates the explicit intentions supported by the standard `awk` language constructs (which you helpfully included and I feel are also very clear). 1. No logic changes by adding `next`. Removing explicit awk directives that intentionally define the purpose of a code block would be an anti-pattern from the quote you provided. 2. Reordering the patterns you refer to in the program changes no code execution plan, so feel free to edit your copy! – Sean Summers Aug 31 '18 at 13:47
Hi Sean, profiling this using --profile shows that your version tests the pattern and executes it - creating the filename n+1 times, where mine calls n times. Note that as there is no pattern to test in my version, so each "test" it should be quicker. Also the result of using `next` is that subsequent patterns are tested n times, whereas without `next` n+1 tests occur. Obviously the number of executions of the pattern body remain the same, however as I've simplified each pattern, each test requires less work in the case of the p array, and no test is required for writing the row out. – Phil Aug 31 '18 at 15:47
Of course I accept that some may feel the loss in clarity of relying on the ordering rules of patterns, is not worth these fairly modest speed-ups - it would depend on the size of the input data and complexity of the per-line processing. I personally find it less cluttered to remove the pattern tests when they are not needed - but I'm arguing that from a textbook point-of-view my script is slightly, and demonstrably by profiling, more efficient. – Phil Aug 31 '18 at 15:51
I appreciate your pedantry and perseverance, @Phil. I am also impressed with what you consider 'fairly modest speed-ups'. I look forward to what you can do with implementing the actual accepted answer here, in C++! – Sean Summers Sep 02 '18 at 00:31
Lol - touche... Well that would definitely be a different question! But my pedantry does stem from thinking about how the underlying C would work. A well written awk would surely not test a lack of pattern predicate? – Phil Sep 02 '18 at 00:49

score 2 · Answer 2 · answered Nov 13 '15 at 20:53

perl -F, -ane '`echo $_ >> $F[2].csv`' < file

These command-line options are used:

-n loop around every line of the input file
-l removes newlines before processing, and adds them back in afterwards
-a autosplit mode – split input lines into the @F array. Defaults to splitting on whitespace.
-e execute the perl code
-F autosplit modifier, in this case splits on ,

@F is the array of words in each line, indexed starting with $F[0]

If you want to retain the header, then a more complicated approach is required.

perl splitintofiles.pl file

Contents of splitintofiles.pl:

open $fh, '<', $ARGV[0];
while ($line = <$fh>) {
    print $line;
    if ($. == 1) {
        $header = $line;
    } else {
        # $fields[2] is the 3rd column
        @fields = split /,/, $line;
        # save line into hash %c
        $c{"$fields[2].csv"} .= $line;
    }
}
close $fh;
for $file (keys %c) {
    print "$file\n";
    open $fh, '>', $file;
    print $fh $header;
    print $fh $c{$file};
    close $fh;
}

input:

a,b,c,d,e,f,g,h,i,j,k,l
19980102,,PLXS,10032,Q,A,,,15.12500,15.00000,15.12500,2
19980105,,PLXS,10032,Q,A,,,14.93750,14.75000,14.93750,2
19980106,,PLXS,10032,Q,A,,,14.56250,14.56250,14.87500,2
20111222,,PCP,63830,N,A,,,164.07001,164.09000,164.12000,1
20111223,,PCP,63830,N,A,,,164.53000,164.53000,164.55000,1
20111227,,PCP,63830,N,A,,,165.69000,165.61000,165.64000,1

output PCP.csv

a,b,c,d,e,f,g,h,i,j,k,l
20111222,,PCP,63830,N,A,,,164.07001,164.09000,164.12000,1
20111223,,PCP,63830,N,A,,,164.53000,164.53000,164.55000,1
20111227,,PCP,63830,N,A,,,165.69000,165.61000,165.64000,1

output PLXS.csv

a,b,c,d,e,f,g,h,i,j,k,l
19980102,,PLXS,10032,Q,A,,,15.12500,15.00000,15.12500,2
19980105,,PLXS,10032,Q,A,,,14.93750,14.75000,14.93750,2
19980106,,PLXS,10032,Q,A,,,14.56250,14.56250,14.87500,2

Mike McMahon · Accepted Answer · 2012-03-30T23:13:06.833

C++ is fine if you know it best. Why would you try to load the entire file into memory anyways?

Since the output is dependent upon the column being read you could easily store buffers for output files and stuff the record into the appropriate file as you process, cleaning as you go to keep the memory footprint relatively small.

I do this (albeit in java) when needing to take massive extracts from a database. The records are pushed into a file buffer stream and anything in the memory is cleaned up so the footprint of the program never grows beyond what it initially starts out at.

Fly by the seat of my pants pseudo-code:

Create a list to hold your output file buffers
Open stream on file and begin reading in the contents one line at a time
Did we encounter a record that has an open file stream for it's content type yet?
- Yes -
  - Get the stored file stream
  - store the record into that file
  - flush the stream
- No -
  - create a stream and save it to our list of streams
  - store the record on the stream
  - flush the stream
Rinse repeat...

Basically continuing this processing until we're at the end of the file.

Since we never store more than pointers to the streams and we're flushing as soon as we write to the streams we don't ever hold anything resident in the memory of the application other than one record from the input file. Thus the footprint is kept managable.

+1: C++ isn't the problem. Loading the whole file in memory is the problem. — Don Roby, Mar 30 '12 at 23:15

score 0 · Answer 4 · answered Jun 12 '12 at 19:07

If the first three columns of your file don't have quoted commas, a simple one-liner is:

cat file | perl -e 'while(<>){@a=split(/,/,$_,4);$key=$a[2];open($f{$key},">$key.csv") unless $f{$key};print {$f{$key}} $_;} for $key (keys %f) {close $f{$key}}'

It doesn't consume much memory (only the associations distinct(3rd_column) --> file-handle are stored) and the rows can come in any order.

If the columns are more complex (contain quoted commas for example) then use Text::CSV.

actually, I just notice this is essentially the same answer as the one from Sean Summers below. — Pierre D, Jun 12 '12 at 19:08

score 0 · Answer 5 · answered Jan 10 '19 at 22:23

If there is no header line in the input file

awk -F, '
{fn = $3".csv"
 print > fn}' bigfile.csv

If there is a header line that should be passed on to the splitted files

awk -F, '
NR==1 {hdr=$0; next}
{fn = $3".csv"}
!seen[$3]++{print hdr > fn}
{print > fn}' bigfile.csv

split large csv text file based on column value

5 Answers5

Linked