How to split a file and keep the first line in each of the pieces?

Question

Given: One big text-data file (e.g. CSV format) with a 'special' first line (e.g., field names).

Wanted: An equivalent of the coreutils split -l command, but with the additional requirement that the header line from the original file appear at the beginning of each of the resulting pieces.

I am guessing some concoction of split and head will do the trick?

It seems reasonable that someone should add that as a built-in feature of `split`, doesn't it? — Dennis Williamson, Sep 11 '09 at 16:49
Probably the biggest factor *against* this becoming a built-in is that you generally reconstruct a split file by doing `cat a b c > reconstructed`. Extraneous lines in the file means the normal reconstruction approach does not reproduce the original file. — Mark Rushakoff, Sep 11 '09 at 18:23
That's what the upcoming (*not*) "`unsplit --remove-header`" utility is for! But seriously, `split`, if it were to have a "repeat-header" option, should still default to its current behavior. You'd only use header stuff if you really wanted it. — Dennis Williamson, Sep 11 '09 at 19:00
Yes, I think `--keep-first N` would make a nice option for `split` which would be useful in both line and byte mode — Arkady, Sep 11 '09 at 19:04
*I* think it *is* a good idea -- absolutely very useful for splitting a file for *distribution* rather than reconstruction. It's one of those "so simple, how is it not there yet" features of a Unix utility so old, that I'm skeptical that the "people in charge" haven't turned down previous proposals to do this exact functionality for some reason or another. — Mark Rushakoff, Sep 11 '09 at 19:14
I think the reasoning might be simply due to the POSIX spec for split not having that option. I can only imagine how difficult it is to add functionality to POSIX standards! http://www.opengroup.org/onlinepubs/009695399/utilities/split.html — Mark Rushakoff, Sep 11 '09 at 19:28
I updated my answer with a cool feature that GNU `split` provides. — Dennis Williamson, Nov 20 '14 at 17:42
I found such a proposal at https://lists.gnu.org/archive/html/bug-coreutils/2003-08/msg00022.html which wasn't so much flat-out turned down as discouraged ( because you ought to be able to write a script / program for that?? ) — justinpitts, Mar 09 '20 at 12:59
Related: [Split CSV files into smaller files but keeping the headers?](https://stackoverflow.com/questions/51420966) — kvantour, Jul 24 '20 at 14:11
The best tool for this purpose is `xsv`: https://stackoverflow.com/a/68585985/8079808 — San, Jul 30 '21 at 06:02

Dennis Williamson · Accepted Answer · 2018-11-16T19:43:33.277

73

This is robhruska's script cleaned up a bit:

tail -n +2 file.txt | split -l 4 - split_
for file in split_*
do
    head -n 1 file.txt > tmp_file
    cat "$file" >> tmp_file
    mv -f tmp_file "$file"
done

I removed wc, cut, ls and echo in the places where they're unnecessary. I changed some of the filenames to make them a little more meaningful. I broke it out onto multiple lines only to make it easier to read.

If you want to get fancy, you could use mktemp or tempfile to create a temporary filename instead of using a hard coded one.

Edit

Using GNU split it's possible to do this:

split_filter () { { head -n 1 file.txt; cat; } > "$FILE"; }; export -f split_filter; tail -n +2 file.txt | split --lines=4 --filter=split_filter - split_

Broken out for readability:

split_filter () { { head -n 1 file.txt; cat; } > "$FILE"; }
export -f split_filter
tail -n +2 file.txt | split --lines=4 --filter=split_filter - split_

When --filter is specified, split runs the command (a function in this case, which must be exported) for each output file and sets the variable FILE, in the command's environment, to the filename.

A filter script or function could do any manipulation it wanted to the output contents or even the filename. An example of the latter might be to output to a fixed filename in a variable directory: > "$FILE/data.dat" for example.

edited Nov 16 '18 at 19:43

answered Sep 11 '09 at 16:47

Dennis Williamson

346,391
90
374
439

This will certainly work. I was just hoping for some slick one-liner like `for $part in (split -l 1000 myfile); cat <(head -n1 myfile) $part > myfile.$part; done` – Arkady Sep 11 '09 at 19:09
That can't work because `split`, of necessity, doesn't output on `stdout`. – Dennis Williamson Sep 11 '09 at 19:19
`split` *could* output the *names* of the files to stdout, though (as long as we are discussing what `split` *ought* to do :-) – Arkady Sep 11 '09 at 19:23
You're right. That could be *handy*. Sorry I misread your one-liner. – Dennis Williamson Sep 11 '09 at 20:04
Mac OS X 10.10.4 worked with the original snippet, but not the one-liner GNU split version. – Johnathan Elmore May 13 '15 at 22:59
1

@JohnathanElmore: Note that GNU utilities are available for OS X. Using [Homebrew](http://brew.sh/), for example. – Dennis Williamson May 14 '15 at 01:35
http://stackoverflow.com/a/30005262/1014710 has instructions for Homebrew GNU coreutils instructions – Johnathan Elmore May 14 '15 at 02:12
You may want to add `--additional-suffix=.txt` to the `split` command to keep the file extension – Henrik Høyer Jun 30 '20 at 15:35

Tim Richardson · Answer 2 · 2022-08-28T22:37:14.130

37

This one-liner will split the big csv into pieces of 999 records, preserving the header row at the top of each one (so 999 records + 1 header = 1000 rows)

cat bigFile.csv | parallel --header : --pipe -N999 'cat >file_{#}.csv'

Based on Ole Tange's answer.

See comments for some tips on installing parallel

edited Aug 28 '22 at 22:37

answered Oct 30 '18 at 10:28

Tim Richardson

6,608
6
44
71

please noted that if we consider the header row in each file then each smaller file will have 1000 rows in this solution. – Peiti Li Jun 17 '19 at 17:31
Which is why I use 999 :) – Tim Richardson Jun 17 '19 at 22:21
6

I had to `brew install parallel` on macOS. Works like a charm! – Asimov4 Mar 19 '20 at 19:16
This was perfect. Thank you so much! – Ram RS Jan 08 '21 at 21:53
3

Like MacOS, Ubuntu 20.04 also needs to have `parallel` installed for this to work. Note that Ubuntu suggests either `sudo apt install moreutils` _# version 0.63-1_, or `sudo apt install parallel` _# version 20161222-1.1_ -- go with the latter suggestion. The first suggestion, `moreutils` sounds extra useful, but the version of parallel included in that package errored out (`parallel: invalid option -- '-'`). The second suggestion worked as expected ([details](https://stackoverflow.com/a/19503387/697507)). – Tracy Logan Jan 18 '21 at 18:22

score 16 · Answer 3 · edited Mar 31 '20 at 19:12

16

You could use the new --filter functionality in GNU coreutils split >= 8.13 (2011):

tail -n +2 FILE.in | split -l 50 - --filter='sh -c "{ head -n1 FILE.in; cat; } > $FILE"'

edited Mar 31 '20 at 19:12

Asclepius

57,944
17
167
143

answered Aug 08 '14 at 00:09

pixelbeat

30,615
9
51
60

2

I like the one-liner version. Just to make it more generic for bash, I did: `tail -n +2 FILE.in | split -d --lines 50 - --filter='bash -c "{ head -n1 ${FILE%.*}; cat; } > $FILE"' FILE.in.x` – KullDox May 04 '17 at 21:30

score 14 · Answer 4 · answered Sep 12 '09 at 15:25

You can use [mg]awk:

awk 'NR==1{
        header=$0; 
        count=1; 
        print header > "x_" count; 
        next 
     } 

     !( (NR-1) % 100){
        count++; 
        print header > "x_" count;
     } 
     {
        print $0 > "x_" count
     }' file

100 is the number of lines of each slice. It doesn't require temp files and can be put on a single line.

Rob Hruska · Answer 5 · 2009-09-11T16:22:53.327

I'm a novice when it comes to Bash-fu, but I was able to concoct this two-command monstrosity. I'm sure there are more elegant solutions.

$> tail -n +2 file.txt | split -l 4
$> for file in `ls xa*`; do echo "`head -1 file.txt`" > tmp; cat $file >> tmp; mv -f tmp $file; done

This is assuming your input file is file.txt, you're not using the prefix argument to split, and you're working in a directory that doesn't have any other files that start with split's default xa* output format. Also, replace the '4' with your desired split line size.

Ole Tange · Answer 6 · 2018-02-22T13:42:15.747

Use GNU Parallel:

parallel -a bigfile.csv --header : --pipepart 'cat > {#}'

If you need to run a command on each of the parts, then GNU Parallel can help do that, too:

parallel -a bigfile.csv --header : --pipepart my_program_reading_from_stdin
parallel -a bigfile.csv --header : --pipepart --fifo my_program_reading_from_fifo {}
parallel -a bigfile.csv --header : --pipepart --cat my_program_reading_from_a_file {}

If you want to split into 2 parts per CPU core (e.g. 24 cores = 48 equal sized parts):

parallel --block -2 -a bigfile.csv --header : --pipepart my_program_reading_from_stdin

If you want to split into 10 MB blocks:

parallel --block 10M -a bigfile.csv --header : --pipepart my_program_reading_from_stdin

Thyag · Answer 7 · 2020-01-21T18:01:40.920

Below is a 4 liner that can be used to split a bigfile.csv into multiple smaller files, and preserve the csv header. Uses only built-in Bash commands (head, split, find, grep, xargs, and sed) which should work on most *nix systems. Should also work on Windows if you install mingw-64 / git-bash.


csvheader=`head -1 bigfile.csv`
split -d -l10000 bigfile.csv smallfile_
find .|grep smallfile_ | xargs sed -i "1s/^/$csvheader\n/"
sed -i '1d' smallfile_00

Line by line explanation:

Capture the header to a variable named csvheader
Split the bigfile.csv into a number of smaller files with prefix smallfile_
Find all smallfiles and insert the csvheader into the FIRST line using xargs and sed -i. Note that you need to use sed within "double quotes" in order to use variables.
The first file named smallfile_00 will now have redundant headers on lines 1 and 2 (from the original data as well as from the sed header insert in step 3). We can remove the redundant header with sed -i '1d' command.

Getting below error: `Error While executing report: Error: Command failed with exit code 123: find .|grep file_part_ | xargs sed -i "1s/^/column 1, column 2 /"sed: -e expression #1, char 78: unterminated `s' command` — Subburaj, Apr 19 '23 at 11:20
@Subburaj Try removing any single/double quotes from the bigfile's header For example if you have `"column 1", "column 2"` - then ideally it should look like : `column 1, column 2` Another option is to change the script's first line like this:
csvheader="column 1, column 2" — Thyag, May 02 '23 at 00:09

score 2 · Answer 8 · answered Sep 11 '09 at 20:04

This is a more robust version of Denis Williamson's script. The script creates a lot of temporary files, and it would be a shame if they were left lying around if the run was incomplete. So, let's add signal trapping (see http://tldp.org/LDP/Bash-Beginners-Guide/html/sect_12_02.html and then http://tldp.org/LDP/abs/html/debugging.html) and remove our temporary files; this is a best practice anyways.

trap 'rm split_* tmp_file ; exit 13' SIGINT SIGTERM SIGQUIT 
tail -n +2 file.txt | split -l 4 - split_
for file in split_*
do
    head -n 1 file.txt > tmp_file
    cat $file >> tmp_file
    mv -f tmp_file $file
done

Replace '13' with whatever return code you want. Oh, and you should probably be using mktemp anyways (as some have already suggested), so go ahead and remove 'tmp_file" from the rm in the trap line. See the signal man page for more signals to catch.

score 2 · Answer 9 · answered Jan 21 '15 at 17:43

2

I liked the awk version of marco, adopted from this a simplified one-liner where you can easily specify the split fraction as granular as you want:

awk 'NR==1{print $0 > FILENAME ".split1";  print $0 > FILENAME ".split2";} NR>1{if (NR % 10 > 5) print $0 >> FILENAME ".split1"; else print $0 >> FILENAME ".split2"}' file

answered Jan 21 '15 at 17:43

dreamflasher

1,387
15
22

I like this solution, however it's limited to only two split files – Bas May 09 '16 at 09:47
If you like it there is the upvote feature for it ;) It can easily be adjusted to more files, but yes it's not as flexible as split -l – dreamflasher May 09 '16 at 10:06
"one liner" ...pshh – Pandem1c Jan 19 '17 at 01:12

score 2 · Answer 10 · answered Jan 29 '15 at 21:42

I really liked Rob and Dennis' versions, so much so that I wanted to improve them.

Here's my version:

in_file=$1
awk '{if (NR!=1) {print}}' $in_file | split -d -a 5 -l 100000 - $in_file"_" # Get all lines except the first, split into 100,000 line chunks
for file in $in_file"_"*
do
    tmp_file=$(mktemp $in_file.XXXXXX) # Create a safer temp file
    head -n 1 $in_file | cat - $file > $tmp_file # Get header from main file, cat that header with split file contents to temp file
    mv -f $tmp_file $file # Overwrite non-header containing file with header-containing file
done

Differences:

in_file is the file argument you want to split maintaining headers
Use awk instead of tail due to awk having better performance
split into 100,000 line files instead of 4
Split file name will be input file name appended with an underscore and numbers (up to 99999 - from the "-d -a 5" split argument)
Use mktemp to safely handle temporary files
Use single head | cat line instead of two lines

Suggestion: change awk script to simply: 'NR > 1' as print is the default action. — runrig, Apr 15 '21 at 19:29
That said, I doubt awk is any faster (or at least significantly faster) than tail in this case. — runrig, Apr 15 '21 at 19:36
I also might put the header in a variable before the loop, and then 'echo "$header | ...." in the loop — runrig, Apr 15 '21 at 19:38

score 1 · Answer 11 · answered Jun 01 '19 at 23:29

Inspired by @Arkady's comment on a one-liner.

MYFILE variable simply to reduce boilerplate
split doesn't show file name, but the --additional-suffix option allows us to easily control what to expect
removal of intermediate files via rm $part (assumes no files with same suffix)

MYFILE=mycsv.csv && for part in $(split -n4 --additional-suffix=foo $MYFILE; ls *foo); do cat <(head -n1 $MYFILE) $part > $MYFILE.$part; rm $part; done

Evidence:

-rw-rw-r--  1 ec2-user ec2-user  32040108 Jun  1 23:18 mycsv.csv.xaafoo
-rw-rw-r--  1 ec2-user ec2-user  32040108 Jun  1 23:18 mycsv.csv.xabfoo
-rw-rw-r--  1 ec2-user ec2-user  32040108 Jun  1 23:18 mycsv.csv.xacfoo
-rw-rw-r--  1 ec2-user ec2-user  32040110 Jun  1 23:18 mycsv.csv.xadfoo

and of course head -2 *foo to see the header is added.

score 0 · Answer 12 · answered May 26 '20 at 22:01

A simple but maybe not as elegant way: Cut off the header beforehand, split the file, and then rejoin the header on each file with cat, or with whatever file is reading it in. So something like:

head -n1 file.txt > header.txt
split -l file.txt
cat header.txt f1.txt

score 0 · Answer 13 · answered Jan 19 '23 at 16:04

I had a better result using the following code, every split file will have a header and the generated files will have a normalized name.

export F=input.csv && LINES=3 &&\
export PF="${F%.*}_" &&\
split -l $LINES "${F}" "${PF}" &&\
for fn in $PF*
do
  mv "${fn}" "${fn}.csv"
done &&\
export FILES=($PF*) && for file in "${FILES[@]:1}"
do
  head -n 1 "${F}" > tmp_file
  cat "$file" >> tmp_file
  mv -f tmp_file "${file}"
done

output

$ wc -l input*
  22 input.csv
   3 input_aa.csv
   4 input_ab.csv
   4 input_ac.csv
   4 input_ad.csv
   4 input_ae.csv
   4 input_af.csv
   4 input_ag.csv
   2 input_ah.csv
  51 total

How to split a file and keep the first line in each of the pieces?

13 Answers13

Linked

Related