100

Given: One big text-data file (e.g. CSV format) with a 'special' first line (e.g., field names).

Wanted: An equivalent of the coreutils split -l command, but with the additional requirement that the header line from the original file appear at the beginning of each of the resulting pieces.

I am guessing some concoction of split and head will do the trick?

Arkady
  • 14,305
  • 8
  • 42
  • 46
  • 16
    It seems reasonable that someone should add that as a built-in feature of `split`, doesn't it? – Dennis Williamson Sep 11 '09 at 16:49
  • 1
    Probably the biggest factor *against* this becoming a built-in is that you generally reconstruct a split file by doing `cat a b c > reconstructed`. Extraneous lines in the file means the normal reconstruction approach does not reproduce the original file. – Mark Rushakoff Sep 11 '09 at 18:23
  • 2
    That's what the upcoming (*not*) "`unsplit --remove-header`" utility is for! But seriously, `split`, if it were to have a "repeat-header" option, should still default to its current behavior. You'd only use header stuff if you really wanted it. – Dennis Williamson Sep 11 '09 at 19:00
  • 3
    Yes, I think `--keep-first N` would make a nice option for `split` which would be useful in both line and byte mode – Arkady Sep 11 '09 at 19:04
  • 1
    *I* think it *is* a good idea -- absolutely very useful for splitting a file for *distribution* rather than reconstruction. It's one of those "so simple, how is it not there yet" features of a Unix utility so old, that I'm skeptical that the "people in charge" haven't turned down previous proposals to do this exact functionality for some reason or another. – Mark Rushakoff Sep 11 '09 at 19:14
  • I think the reasoning might be simply due to the POSIX spec for split not having that option. I can only imagine how difficult it is to add functionality to POSIX standards! http://www.opengroup.org/onlinepubs/009695399/utilities/split.html – Mark Rushakoff Sep 11 '09 at 19:28
  • 1
    I updated my answer with a cool feature that GNU `split` provides. – Dennis Williamson Nov 20 '14 at 17:42
  • 1
    I found such a proposal at https://lists.gnu.org/archive/html/bug-coreutils/2003-08/msg00022.html which wasn't so much flat-out turned down as discouraged ( because you ought to be able to write a script / program for that?? ) – justinpitts Mar 09 '20 at 12:59
  • Related: [Split CSV files into smaller files but keeping the headers?](https://stackoverflow.com/questions/51420966) – kvantour Jul 24 '20 at 14:11
  • The best tool for this purpose is `xsv`: https://stackoverflow.com/a/68585985/8079808 – San Jul 30 '21 at 06:02

13 Answers13

73

This is robhruska's script cleaned up a bit:

tail -n +2 file.txt | split -l 4 - split_
for file in split_*
do
    head -n 1 file.txt > tmp_file
    cat "$file" >> tmp_file
    mv -f tmp_file "$file"
done

I removed wc, cut, ls and echo in the places where they're unnecessary. I changed some of the filenames to make them a little more meaningful. I broke it out onto multiple lines only to make it easier to read.

If you want to get fancy, you could use mktemp or tempfile to create a temporary filename instead of using a hard coded one.

Edit

Using GNU split it's possible to do this:

split_filter () { { head -n 1 file.txt; cat; } > "$FILE"; }; export -f split_filter; tail -n +2 file.txt | split --lines=4 --filter=split_filter - split_

Broken out for readability:

split_filter () { { head -n 1 file.txt; cat; } > "$FILE"; }
export -f split_filter
tail -n +2 file.txt | split --lines=4 --filter=split_filter - split_

When --filter is specified, split runs the command (a function in this case, which must be exported) for each output file and sets the variable FILE, in the command's environment, to the filename.

A filter script or function could do any manipulation it wanted to the output contents or even the filename. An example of the latter might be to output to a fixed filename in a variable directory: > "$FILE/data.dat" for example.

Dennis Williamson
  • 346,391
  • 90
  • 374
  • 439
37

This one-liner will split the big csv into pieces of 999 records, preserving the header row at the top of each one (so 999 records + 1 header = 1000 rows)

cat bigFile.csv | parallel --header : --pipe -N999 'cat >file_{#}.csv'

Based on Ole Tange's answer.

See comments for some tips on installing parallel

Tim Richardson
  • 6,608
  • 6
  • 44
  • 71
  • please noted that if we consider the header row in each file then each smaller file will have 1000 rows in this solution. – Peiti Li Jun 17 '19 at 17:31
  • Which is why I use 999 :) – Tim Richardson Jun 17 '19 at 22:21
  • 6
    I had to `brew install parallel` on macOS. Works like a charm! – Asimov4 Mar 19 '20 at 19:16
  • This was perfect. Thank you so much! – Ram RS Jan 08 '21 at 21:53
  • 3
    Like MacOS, Ubuntu 20.04 also needs to have `parallel` installed for this to work. Note that Ubuntu suggests either `sudo apt install moreutils` _# version 0.63-1_, or `sudo apt install parallel` _# version 20161222-1.1_ -- go with the latter suggestion. The first suggestion, `moreutils` sounds extra useful, but the version of parallel included in that package errored out (`parallel: invalid option -- '-'`). The second suggestion worked as expected ([details](https://stackoverflow.com/a/19503387/697507)). – Tracy Logan Jan 18 '21 at 18:22
16

You could use the new --filter functionality in GNU coreutils split >= 8.13 (2011):

tail -n +2 FILE.in | split -l 50 - --filter='sh -c "{ head -n1 FILE.in; cat; } > $FILE"'
Asclepius
  • 57,944
  • 17
  • 167
  • 143
pixelbeat
  • 30,615
  • 9
  • 51
  • 60
  • 2
    I like the one-liner version. Just to make it more generic for bash, I did: `tail -n +2 FILE.in | split -d --lines 50 - --filter='bash -c "{ head -n1 ${FILE%.*}; cat; } > $FILE"' FILE.in.x` – KullDox May 04 '17 at 21:30
14

You can use [mg]awk:

awk 'NR==1{
        header=$0; 
        count=1; 
        print header > "x_" count; 
        next 
     } 

     !( (NR-1) % 100){
        count++; 
        print header > "x_" count;
     } 
     {
        print $0 > "x_" count
     }' file

100 is the number of lines of each slice. It doesn't require temp files and can be put on a single line.

marco
  • 4,455
  • 1
  • 23
  • 20
8

I'm a novice when it comes to Bash-fu, but I was able to concoct this two-command monstrosity. I'm sure there are more elegant solutions.

$> tail -n +2 file.txt | split -l 4
$> for file in `ls xa*`; do echo "`head -1 file.txt`" > tmp; cat $file >> tmp; mv -f tmp $file; done

This is assuming your input file is file.txt, you're not using the prefix argument to split, and you're working in a directory that doesn't have any other files that start with split's default xa* output format. Also, replace the '4' with your desired split line size.

Rob Hruska
  • 118,520
  • 32
  • 167
  • 192
4

Use GNU Parallel:

parallel -a bigfile.csv --header : --pipepart 'cat > {#}'

If you need to run a command on each of the parts, then GNU Parallel can help do that, too:

parallel -a bigfile.csv --header : --pipepart my_program_reading_from_stdin
parallel -a bigfile.csv --header : --pipepart --fifo my_program_reading_from_fifo {}
parallel -a bigfile.csv --header : --pipepart --cat my_program_reading_from_a_file {}

If you want to split into 2 parts per CPU core (e.g. 24 cores = 48 equal sized parts):

parallel --block -2 -a bigfile.csv --header : --pipepart my_program_reading_from_stdin

If you want to split into 10 MB blocks:

parallel --block 10M -a bigfile.csv --header : --pipepart my_program_reading_from_stdin
Ole Tange
  • 31,768
  • 5
  • 86
  • 104
4

Below is a 4 liner that can be used to split a bigfile.csv into multiple smaller files, and preserve the csv header. Uses only built-in Bash commands (head, split, find, grep, xargs, and sed) which should work on most *nix systems. Should also work on Windows if you install mingw-64 / git-bash.


csvheader=`head -1 bigfile.csv`
split -d -l10000 bigfile.csv smallfile_
find .|grep smallfile_ | xargs sed -i "1s/^/$csvheader\n/"
sed -i '1d' smallfile_00

Line by line explanation:

  1. Capture the header to a variable named csvheader
  2. Split the bigfile.csv into a number of smaller files with prefix smallfile_
  3. Find all smallfiles and insert the csvheader into the FIRST line using xargs and sed -i. Note that you need to use sed within "double quotes" in order to use variables.
  4. The first file named smallfile_00 will now have redundant headers on lines 1 and 2 (from the original data as well as from the sed header insert in step 3). We can remove the redundant header with sed -i '1d' command.
Thyag
  • 1,217
  • 13
  • 14
  • Getting below error: `Error While executing report: Error: Command failed with exit code 123: find .|grep file_part_ | xargs sed -i "1s/^/column 1, column 2 /"sed: -e expression #1, char 78: unterminated `s' command` – Subburaj Apr 19 '23 at 11:20
  • @Subburaj Try removing any single/double quotes from the bigfile's header For example if you have `"column 1", "column 2"` - then ideally it should look like : `column 1, column 2` Another option is to change the script's first line like this:
    csvheader="column 1, column 2"
    
    – Thyag May 02 '23 at 00:09
2

This is a more robust version of Denis Williamson's script. The script creates a lot of temporary files, and it would be a shame if they were left lying around if the run was incomplete. So, let's add signal trapping (see http://tldp.org/LDP/Bash-Beginners-Guide/html/sect_12_02.html and then http://tldp.org/LDP/abs/html/debugging.html) and remove our temporary files; this is a best practice anyways.

trap 'rm split_* tmp_file ; exit 13' SIGINT SIGTERM SIGQUIT 
tail -n +2 file.txt | split -l 4 - split_
for file in split_*
do
    head -n 1 file.txt > tmp_file
    cat $file >> tmp_file
    mv -f tmp_file $file
done

Replace '13' with whatever return code you want. Oh, and you should probably be using mktemp anyways (as some have already suggested), so go ahead and remove 'tmp_file" from the rm in the trap line. See the signal man page for more signals to catch.

Sam Bisbee
  • 4,461
  • 20
  • 25
2

I liked the awk version of marco, adopted from this a simplified one-liner where you can easily specify the split fraction as granular as you want:

awk 'NR==1{print $0 > FILENAME ".split1";  print $0 > FILENAME ".split2";} NR>1{if (NR % 10 > 5) print $0 >> FILENAME ".split1"; else print $0 >> FILENAME ".split2"}' file
dreamflasher
  • 1,387
  • 15
  • 22
2

I really liked Rob and Dennis' versions, so much so that I wanted to improve them.

Here's my version:

in_file=$1
awk '{if (NR!=1) {print}}' $in_file | split -d -a 5 -l 100000 - $in_file"_" # Get all lines except the first, split into 100,000 line chunks
for file in $in_file"_"*
do
    tmp_file=$(mktemp $in_file.XXXXXX) # Create a safer temp file
    head -n 1 $in_file | cat - $file > $tmp_file # Get header from main file, cat that header with split file contents to temp file
    mv -f $tmp_file $file # Overwrite non-header containing file with header-containing file
done

Differences:

  1. in_file is the file argument you want to split maintaining headers
  2. Use awk instead of tail due to awk having better performance
  3. split into 100,000 line files instead of 4
  4. Split file name will be input file name appended with an underscore and numbers (up to 99999 - from the "-d -a 5" split argument)
  5. Use mktemp to safely handle temporary files
  6. Use single head | cat line instead of two lines
Garren S
  • 5,552
  • 3
  • 30
  • 45
  • Suggestion: change awk script to simply: 'NR > 1' as print is the default action. – runrig Apr 15 '21 at 19:29
  • That said, I doubt awk is any faster (or at least significantly faster) than tail in this case. – runrig Apr 15 '21 at 19:36
  • I also might put the header in a variable before the loop, and then 'echo "$header | ...." in the loop – runrig Apr 15 '21 at 19:38
1

Inspired by @Arkady's comment on a one-liner.

  • MYFILE variable simply to reduce boilerplate
  • split doesn't show file name, but the --additional-suffix option allows us to easily control what to expect
  • removal of intermediate files via rm $part (assumes no files with same suffix)

MYFILE=mycsv.csv && for part in $(split -n4 --additional-suffix=foo $MYFILE; ls *foo); do cat <(head -n1 $MYFILE) $part > $MYFILE.$part; rm $part; done

Evidence:

-rw-rw-r--  1 ec2-user ec2-user  32040108 Jun  1 23:18 mycsv.csv.xaafoo
-rw-rw-r--  1 ec2-user ec2-user  32040108 Jun  1 23:18 mycsv.csv.xabfoo
-rw-rw-r--  1 ec2-user ec2-user  32040108 Jun  1 23:18 mycsv.csv.xacfoo
-rw-rw-r--  1 ec2-user ec2-user  32040110 Jun  1 23:18 mycsv.csv.xadfoo

and of course head -2 *foo to see the header is added.

0

A simple but maybe not as elegant way: Cut off the header beforehand, split the file, and then rejoin the header on each file with cat, or with whatever file is reading it in. So something like:

  1. head -n1 file.txt > header.txt
  2. split -l file.txt
  3. cat header.txt f1.txt
0

I had a better result using the following code, every split file will have a header and the generated files will have a normalized name.

export F=input.csv && LINES=3 &&\
export PF="${F%.*}_" &&\
split -l $LINES "${F}" "${PF}" &&\
for fn in $PF*
do
  mv "${fn}" "${fn}.csv"
done &&\
export FILES=($PF*) && for file in "${FILES[@]:1}"
do
  head -n 1 "${F}" > tmp_file
  cat "$file" >> tmp_file
  mv -f tmp_file "${file}"
done

output

$ wc -l input*
  22 input.csv
   3 input_aa.csv
   4 input_ab.csv
   4 input_ac.csv
   4 input_ad.csv
   4 input_ae.csv
   4 input_af.csv
   4 input_ag.csv
   2 input_ah.csv
  51 total
deFreitas
  • 4,196
  • 2
  • 33
  • 43