25

I have a huge CSV file, 1m lines. I was wondering if there is a way to split this file into smaller ones but keeping the first line (CSV header) on all the files.

It seems split is very fast but is also very limited. You cannot add a suffix to the filenames like .csv.

split -l11000 products.csv file_

Is there an effective way to do this task in just bash? A one-line command would be great.

kvantour
  • 25,269
  • 4
  • 47
  • 72
neisantos
  • 492
  • 1
  • 5
  • 16
  • Hmm. I forgot that my gold bash gives me the ability to shut down a question. Didn't really mean for that to happen. I know other readers can still reply. The answer below looks pretty good. Put it in a shell script and then your have you one-liner. Good luck.. – shellter Jul 19 '18 at 12:58
  • 2
    Voting to re-open as the question isn't a duplicate. There is a specific difference in retaining the first line along all the split files. – nikhil Nov 29 '18 at 20:25
  • Does this answer your question? [How to split a file and keep the first line in each of the pieces?](https://stackoverflow.com/questions/1411713/how-to-split-a-file-and-keep-the-first-line-in-each-of-the-pieces) – Seanny123 Oct 26 '20 at 22:37
  • The best tool for this purpose is `xsv`: https://stackoverflow.com/a/68585985/8079808 – San Jul 30 '21 at 06:00

2 Answers2

73

The answer to this question is yes, this is possible with AWK.

The idea is to keep the header in mind and print all the rest in filenames of the form filename.00001.csv:

awk -v l=11000 '(NR==1){header=$0;next}
                (NR%l==2) {
                   close(file); 
                   file=sprintf("%s.%0.5d.csv",FILENAME,++c)
                   sub(/csv[.]/,"",file)
                   print header > file
                }
                {print > file}' file.csv

This works in the following way:

  • (NR==1){header=$0;next}: If the record/line is the first line, save that line as the header.
  • (NR%l==2){...}: Every time we wrote l=11000 records/lines, we need to start writing to a new file. This happens every time the modulo of the record/line number hits 2. This is on the lines 2, 2+l, 2+2l, 2+3l,.... When such a line is found we do:
    • close(file): close the file you just wrote too.
    • file=sprintf("%s.%0.5d.csv",FILENAME,++c); sub(/csv[.]/,"",file): define the new filename as FILENAME.00XXX.csv
    • print header > file: open the file and write the header to that file.
  • {print > file}: write the entries to the file.

note: If you don't care about the filename, you can use the following shorter version:

awk -v m=100 '
    (NR==1){h=$0;next}
    (NR%m==2) { close(f); f=sprintf("%s.%0.5d",FILENAME,++c); print h > f }
    {print > f}' file.csv
kvantour
  • 25,269
  • 4
  • 47
  • 72
  • thank you @kvantour this worked very well and pretty fast 5seconds on a file with 1 million rows. much better than my php code :) cheers. – neisantos Jul 20 '18 at 19:46
  • I am trying to learn more about AWK. I understand variables are dynamic, but how does close(file) not throw an exception as you're trying to close undefined? – Jonny Waffles Jun 10 '20 at 16:38
  • 1
    @JonnyWaffles `close()` silently does nothing if given an argument that does not represent a file, pipe, or coprocess that was opened with a redirection. In such a case, it returns a negative value, indicating an error. – kvantour Jun 10 '20 at 16:57
  • Thanks for your help. I didn't see the silent passing behavior within the docs I found https://www.gnu.org/software/gawk/manual/html_node/I_002fO-Functions.html. Do you have a particular resource you could recommend for understanding AWK? I am trying to write a similar script that pipes the files to the s3 cli over stdout so I don't write unnecessarily large files to disk, but I don't know how to buffer a logical partition file and call s3 cp - for each one. – Jonny Waffles Jun 10 '20 at 17:33
  • For anyone else who wants to write the output to s3 over stdout it turns out you can simply [pipe](https://www.gnu.org/software/gawk/manual/html_node/Redirection.html#Redirection) to a command, so change `file` to something like `command = "aws s3 cp - s3://bucket/key" c++ ".csv`" – Jonny Waffles Jun 10 '20 at 21:11
  • 1
    This is the _best_ awk answer I have ever seen - clearly written, clearly explained, just great. Wish I could give 2 upvotes. – tilde Apr 07 '21 at 13:02
12

Using GNU split to split file.csv:

export inputPrefix='file' parts=16 && split --verbose -d -n l/${parts} --additional-suffix=.csv --filter='([ "$FILE" != "${inputPrefix}.00.csv" ] && head -1 "${inputPrefix}.csv" ; cat) > "$FILE"' "${inputPrefix}.csv" "${inputPrefix}."
nzkeith
  • 343
  • 2
  • 10