3

I have a huge csv file which is about 20 GB. It has 5,000 columns and 2,500,000 rows. I want to write each column of that into one file. I already tried FOR loop, but it is pretty slow. My code is below:

Columns=$(head -n 1 train.csv | sed "s/,/\n/g" | wc -l)
mkdir cols
for i in `seq 1 $Columns`;
do
    echo $i
    tail -n +2 train.csv | cut -d',' -f$i > cols/col_$i.txt
done 

I appropriate any suggestion to accelerate this.

Mojtaba
  • 306
  • 3
  • 15
  • 1
    How many columns are there? – choroba Sep 03 '16 at 22:12
  • ~5,000 columns and 2,500,000 rows. – Mojtaba Sep 03 '16 at 22:17
  • 1
    you can't load it into a db as a work table and process from there? , hmm 5,000 cols, probably not. I think the OS will limit the speed that this can run based on the number of open files allowed at once. As you're writing each of 5000 columns that means 5000 files, so I'd be surprised if any OS (without a custom build) will support that. look at output of `ulimit -a | grep file` . What do you see for `nofile`. That is the number of open files for 1 process. So you'll probably have to work around that. and close file[1], when you open `file[n+1]`. Good luck. – shellter Sep 03 '16 at 22:23
  • What do you want with an embedded comma, like `1,"b,b","ccc"``? – Walter A Sep 04 '16 at 09:03
  • Another approach: First switch colums and rows like `http://stackoverflow.com/a/1729980/3220113` showed and then use awk to print each line in another file. – Walter A Sep 04 '16 at 09:13
  • `ulimit -a | grep file` on my Debian won't output `nofile` but if I `grep "open files" /proc//limits` I get 65535. This is stable. – James Brown Sep 04 '16 at 20:04
  • 1
    @JamesBrown : Nice, I need to update my environment! I can clearly recall having 20 open files to work with on Sun OS (back in the day ;-) ). Good luck to all! – shellter Sep 04 '16 at 23:03

3 Answers3

3

Here's a bash script that does that in a single pass:

Columns=$(head -n 1 train.csv | sed "s/,/\n/g" | wc -l)
mkdir cols
tail -n +2 train.csv | \
    while IFS=, read -ra row; do
        for i in `seq 1 $Columns`; do
            echo "${row[$(($i-1))]}" >> cols/col_$i.txt
        done 
    done

The disadvantage of this script is that it will open and close the column files millions of times. The following perl script avoids that issue by keeping all files open:

#!/usr/bin/perl
use strict;
use warnings;

my @handles;
open my $fh,'<','train.csv' or die;
<$fh>; #skip the header
while (<$fh>) {
    chomp;
    my @values=split /,/;
    for (my $i=0; $i<@values; $i++) {
        if (!defined $handles[$i]) {
            open $handles[$i],'>','cols/col_'.($i+1).'.txt' or die;
        }
        print {$handles[$i]} "$values[$i]\n";
    }
}
close $fh;
close $_ for @handles;

Since you have 5000 columns and this scripts keeps 5001 files open, you would need to increase the number of open file descriptors that your system allows you to have.

redneb
  • 21,794
  • 6
  • 42
  • 54
3

In awk:

$ awk '{for(i=1;i<=NF;i++) print $i > i}' train.csv

A test version that produces 5000 files:

$ cat > foo
1
2
3
$ awk 'BEGIN {for(i=1;i<=5000;i++) a=a i (i<5000? OFS:"")} {$0=a; for(i=1;i<=NF; i++) print $i > i}' foo
$ ls -l | wc -l
5002 # = 1-5000 + foo and "total 20004"
$ cat 5000
5000
5000
5000

For 250 rows it lasted on my laptop:

real    1m4.691s
user    1m4.456s
sys     0m0.180s
James Brown
  • 36,089
  • 7
  • 43
  • 59
2

Perl solution. It openes 1000 files at once, so it will pass over your input 5 times. Run with the input filename as parameter.

#!/usr/bin/perl
use warnings;
use strict;

my $inputfile = shift;
open my $input, '<', $inputfile or die $!;

mkdir 'cols';

my @headers = split /,/, <$input>;
chomp $headers[-1];
my $pos = tell $input;  # Remember where the first data line starts.

my $step = 1000;
for (my $from = 0; $from <= $#headers; $from += $step) {
    my $to = $from + $step - 1;
    $to = $#headers if $#headers < $to;
    warn "$from .. $to";

    # Open the files and print the headers in range.    
    my @fhs;
    for ($from .. $to) {
        open $fhs[ $_ - $from ], '>', "cols/col-$_" or die $!;
        print { $fhs[ $_ - $from ] } $headers[$_], "\n";
    }

    # Print the columns in range.
    while (<$input>) {
        chomp;
        my $i = 0;
        print { $fhs[$i++] } $_, "\n" for (split /,/)[ $from .. $to ];
    }
    close for @fhs;
    seek $input, $pos, 0;  # Go back to the first data line.
}
choroba
  • 231,213
  • 25
  • 204
  • 289