9

How to split a large csv file (~100GB) and preserve the header in each part ?

For instance

h1 h2
a  aa
b  bb

into

h1 h2
a  aa

and

h1 h2
b  bb
echo
  • 1,241
  • 1
  • 13
  • 16

3 Answers3

6

First you need to separate the header and the content :

header=$(head -1 $file)
data=$(tail -n +2 $file)

Then you want to split the data

echo $data | split [options...] -

In the options you have to specify the size of the chunks and the pattern for the name of the resulting files. The trailing - must not be removed as it specifies split to read data from stdin.

Then you can insert the header at the top of each file

sed -i "1i$header" $splitOutputFile

You should obviously do that last part in a for loop, but its exact code will depend on the prefix chosen for the split operation.

Aaron
  • 24,009
  • 2
  • 33
  • 57
1

I found any previous solutions to this to not work properly on the mac systems that my script was targeting (why Apple? why?) I eventually ended up with a printf option that worked out pretty good as a proof of concept. I'm going to enhance this by putting the temporary files into a ramdisk and the like to improve performance since it is putting a bunch on disk as is and will probably be slow.

#!/bin/sh

# Pass a file in as the first argument on the command line (note, not secure)
file=$1

# Get the header file out
header=$(head -1 $file)

# Separate the data from the header
tail -n +2 $file > output.data

# Split the data into 1000 lines per file (change as you wish)
split -l 1000 output.data output

# Append the header back into each file from split 
for part in `ls -1 output*`
do
  printf "%s\n%s" "$header" "`cat $part`" > $part
done
Josiah
  • 2,666
  • 5
  • 30
  • 40
0

you may download a freeware CsvSplitter from here. It is a zip from the website that contains a simple portable .exe file and a .txt file, necessary to work along with the executable, just extract the content in some directory and you're ready to work:

enter image description here and it can split the file as can be seen in this picture enter image description here

Everything is self-explanatory but more details can be found here

Shaina Raza
  • 1,474
  • 17
  • 12