1

How to split a large csv file (1GB) into multiple files (say one part with 1000 rows, 2nd part 10000 rows, 3rd part 100000, etc) and preserve the header in each part ?

How can I achieve this

h1 h2
a  aa
b  bb
c  cc
.
.
12483720 rows

into

h1 h2
a  aa
b  bb
.
.
.
1000 rows

And

h1 h2
x  xx
y  yy
.
.
.
10000 rows
Aaron
  • 24,009
  • 2
  • 33
  • 57
SST
  • 2,054
  • 5
  • 35
  • 65
  • Are you splitting (the whole input file should be represented in the output files, and the output files shouldn't contain duplicate data), or generating samples of varying size? – Aaron Mar 10 '20 at 10:31
  • 1
    It shoudn't contain the duplicate data. first file contains first 1000 lines, the 2nd file contains 10000 rows from the 1000 rows. is it clear? – SST Mar 10 '20 at 10:34
  • each successive file contains **10x** more lines than the previous file, correct? – markp-fuso Mar 10 '20 at 13:42
  • Related: [Split CSV files into smaller files but keeping the headers?](https://stackoverflow.com/questions/51420966) – kvantour Mar 10 '20 at 14:11

5 Answers5

1

Here is a small adaptation of the solution from: Split CSV files into smaller files but keeping the headers?

awk -v l=1000  '(NR==1){header=$0;next}
                (n==l) { 
                   c=sprintf("%0.5d",c+1); 
                   close(file); file=FILENAME; sub(/csv$/,c".csv",file)
                   print header > file
                   n=0;l*=10
                }
                {print $0 > file; n++}' file.csv

This works in the following way:

  • (NR==1){header=$0;next}: If the record/line is the first line, save that line as the header.
  • (n==l){...}: Every time we wrote the requested amount of records/lines, we need to start writing to a new file. This happens every time n==l and we perform the following actions:
    • c=sprintf("%0.5d",c+1): increase the counter with one, and print it as 000xx
    • close(file): close the file you just wrote too.
    • file=FILENAME; sub(/csv$/,c".csv",file): define the new filename
    • print header > file: open the file and write the header to that file.
    • n=0: reset the current record count
    • l*=10: increase the maximum record count for the next file
  • {print $0 > file; n++}: write the entries to the file and increment the record count
kvantour
  • 25,269
  • 4
  • 47
  • 72
  • I am getting the following error awk: cmd. line:8: (FILENAME=file.csv FNR=2) fatal: expression for `>' redirection has null string value – SST Mar 11 '20 at 05:31
1

Another awk. First some test records:

$ seq 1 1234567 > file

Then the awk:

$ awk 'NR==1{n=1000;h=$0}{print > n}NR==n+c{n*=10;c=NR-1;print h>n}' file

Explained:

$ awk '
NR==1 {           # first record:
    n=1000        # set first output file size and
    h=$0          # store the header
}
{
    print > n     # output to file
}
NR==n+c {         # once target NR has been reached. close(n) goes here if needed
    n*=10         # grow target magnitude
    c=NR-1        # set the correction factor. 
    print h > n   # first the head
}' file

Count the records:

$ wc -l 1000*
   1000 1000
  10000 10000
 100000 100000
1000000 1000000
 123571 10000000
1234571 total
James Brown
  • 36,089
  • 7
  • 43
  • 59
0

Hacky, but utlizes the split utility, which does most of the heavy lifting for splitting the files. Then, with the split files with a well-defined naming convention, I loop over files without the header, and spit out a file with the header concatenated with the file body to tmp.txt, and then move that file back to the original filename.

# Use `split` utility to split the file csv, with 5000 lines per files,
# adding numerical suffixs, and adding additional suffix '.split' to help id
# files.
split -l 5000 -d --additional-suffix=.split repro-driver-table.csv

# This identifies all files that should NOT have headers
# ls -1 *.split | egrep -v -e 'x0+\.split'

# This identifies files that do have headers
# ls -1 *.split | egrep -e 'x0+\.split'

# Walk the files that do not have headers.  For each one, cat the header from
# file with header, with rest of body, output to tmp.txt, then mv tmp.txt to
# original filename.
for f in $(ls -1 *.split | egrep -v -e 'x0+\.split'); do
    cat <(head -1 $(ls -1 *.split | egrep -e 'x0+\.split')) $f > tmp.txt
    mv tmp.txt $f
done
mpettis
  • 3,222
  • 4
  • 28
  • 35
-1

Here's a first approach:

#!/bin/bash
head -1 $1 >header
split $1 y
for f in y*; do
    cp header h$f
    cat $f >>h$f
done
rm -f header
rm -f y*
Hellmar Becker
  • 2,824
  • 12
  • 18
-1

The following bash solution should work nicely :

IFS='' read -r header
for ((curr_file_max_rows=1000; 1; curr_file_max_rows*=10)) {
    curr_file_name="file_with_${curr_file_max_rows}_rows"
    echo "$header" > "$curr_file_name"
    for ((curr_file_row_count=0; curr_file_row_count < curr_file_max_rows; curr_file_row_count++)) {
        IFS='' read -r row || break 2
        echo "$row" >> "$curr_file_name"
    }
}

We have a first iteration level which produces the number of rows we're going to write for each successive file. It generates the file names and write the header to them. It is an infinite loop because we don't check how many lines the input has and therefore don't know beforehand how many files we're going to write to, so we'll have to break out of this loop to end it.

Inside this loop we iterate a second time, this time over the number of lines we're going to write to the current file. In this loop we try to read a line from the input file. If it works we write it to the current output file, if it doesn't (we've reached the end of the input) we break out of two levels of loop.

You can try it here.

Aaron
  • 24,009
  • 2
  • 33
  • 57