115

I've CSV file (around 10,000 rows ; each row having 300 columns) stored on LINUX server. I want to break this CSV file into 500 CSV files of 20 records each. (Each having same CSV header as present in original CSV)

Is there any linux command to help this conversion?

Pawan Mude
  • 1,609
  • 6
  • 19
  • 32
  • 2
    This does indeed work, however, the first file's header will be duplicated thus giving the first csv file 2 headers. – rickman May 11 '17 at 10:43
  • Related: [Split CSV files into smaller files but keeping the headers?](https://stackoverflow.com/questions/51420966) – kvantour Jul 24 '20 at 14:10
  • Does this answer your question? [How to split a file and keep the first line in each of the pieces?](https://stackoverflow.com/questions/1411713/how-to-split-a-file-and-keep-the-first-line-in-each-of-the-pieces) – Seanny123 Oct 26 '20 at 22:15

6 Answers6

201

Use the Linux split command:

split -l 20 file.txt new    

Split the file "file.txt" into files beginning with the name "new" each containing 20 lines of text each.

Type man split at the Unix prompt for more information. However you will have to first remove the header from file.txt (using the tail command, for example) and then add it back on to each of the split files.

James King
  • 6,229
  • 3
  • 25
  • 40
  • How can I skip the file.txt first (header) line? – forkfork Apr 05 '17 at 15:30
  • use `wc -l` to get the number of lines, then take this value subtract 1 (suppose that `wc -l` gave 50) and then run `tail -n 49` (in this example) to get everything except the header line. Note that `wc -l` counts characters so if the last line does not end in a newline character you will be off by 1 in your number of lines. – Lucas Roberts Feb 25 '18 at 23:45
  • 4
    @lucas, `tail -n +2` will print all lines but the first. – James King Feb 26 '18 at 02:57
  • @JamesKing, +1 & Thanks! I should read the docs more closely :) – Lucas Roberts Feb 26 '18 at 19:08
  • 1
    Point to note: this won't work correctly if you have a column value with newlines in it (such as you get from exporting from Excel or database tables with large blocks of text in them). – Femi Apr 29 '19 at 08:24
  • 19
    for keep the ".csv" in new files just add `--additional-suffix=.csv` – Cocuba Aug 28 '19 at 15:14
  • 1
    How do I add the header back to each of the files? – Adam Jul 01 '21 at 11:54
  • 2
    You can also do `split --number=l/500` if you want 500 files of roughly similar size "without splitting lines/records" OR `split --line-bytes=100KB` if you want to have all files less than 100kB, again without lines split across files (cf https://www.gnu.org/software/coreutils/manual/html_node/split-invocation.html) – bluu Dec 21 '21 at 11:49
  • How to retain the header – Ashok kumar Ganesan Jul 15 '22 at 21:41
103

Made it into a function. You can now call splitCsv <Filename> [chunkSize]

splitCsv() {
    HEADER=$(head -1 $1)
    if [ -n "$2" ]; then
        CHUNK=$2
    else 
        CHUNK=1000
    fi
    tail -n +2 $1 | split -l $CHUNK - $1_split_
    for i in $1_split_*; do
        sed -i -e "1i$HEADER" "$i"
    done
}

Found on: http://edmondscommerce.github.io/linux/linux-split-file-eg-csv-and-keep-header-row.html

Philipp Moers
  • 570
  • 3
  • 15
Martin Dinov
  • 8,757
  • 3
  • 29
  • 41
  • 2
    Can you explain the way its working ? I checked the original post but there is also no explanation, and not even an option to post comments. – shashi009 Jun 10 '16 at 13:30
  • 6
    what if a cell in CSV contains a newline? – Ondřej Hlaváček Nov 11 '16 at 07:49
  • What kind of newline character is it? It still works for me, on UTF-8, not sure if this helps. – owyongsk Jan 21 '17 at 02:25
  • 3
    @shashi009: Assume the original file is called file.txt. 1: skip the first line, then pipe the rest of the file into `split`, which splits into new files each 20 lines long, with the prefix `split_` 2: iterate through the new split_* files, storing each name to the variable `file`, one at a time 3: for each... 4: write the first line (column headers) from our original file to a `tmp_file` 5: append the 20 line split file to `tmp_file` 6: overwrite the old split_* file with the new `tmp_file`, so it keeps the column headers – David Jan 24 '17 at 10:47
  • 2
    Prepending the header via `echo -e "$HEADER\n$(cat $i)" > $i` is unnecessarily imperformant. I replaced it with `sed -i -e "1i$HEADER" "$i"`. – Philipp Moers Mar 27 '20 at 15:46
  • splitCsv 15kcontacts.csv 5000 sed: 1: "1i254726770792": command i expects \ followed by text – Nixon Kosgei Aug 01 '21 at 19:25
  • It seems to not encode quotes correctly. I also get `sed: 1: "1i"id","Period start"," ...": command i expects \ followed by text` – mjaggard Sep 13 '22 at 15:13
34

One-liner csv splitter which preserves the header row in each split file. This example gives you 999 lines of data and one header row per file.

cat bigFile.csv | parallel --header : --pipe -N999 'cat >split_file_{#}.csv'

for tips on installing parallel, see https://stackoverflow.com/a/53062251/401226 where the answer has comments about installing the correct version of parallel for macos and Debian/Ubuntu (in ubuntu use the specifically-named parallel package, which is more recent than what is bundled in moreutils)

Tim Richardson
  • 6,608
  • 6
  • 44
  • 71
31

This should work !!!

file_name = Name of the file you want to split.
10000 = Number of rows each split file would contain
file_part_ = Prefix of split file name (file_part_0,file_part_1,file_part_2..etc goes on)

split -d -l 10000 file_name.csv file_part_

Soumyaansh
  • 8,626
  • 7
  • 45
  • 45
  • this works absolutely fine! Is there a way I can limit the number of 10000 row files i create. Say I just want the first 200,000 rows divided into 10k row csv files and ignore the rest. – Pronomita Dey Jun 18 '18 at 12:03
  • 1
    @Pronomita `head -200000 file.csv | split -l 10000 - new_` – boloyao Sep 30 '18 at 00:55
  • 3
    I had a 13 Gb CSV file with a corrupt line at about line ~69 000 000. It was blocking the import into bigquery. This allowed me to split it recursively until I could isolate the line and fix it. No other solution was as efficient in handling large files without bringing the server down or locking things up. Took about 2 minutes to split the file in 5 000 000 row chunks. Thanks! – LP Papillon Sep 23 '19 at 13:05
  • 1
    Note that this has the same issue as [the most upvoted answer](https://stackoverflow.com/a/20721171/2859408), which also uses `split`: It will fail and mess up the results if your CSV contains cells with embedded newline characters. `split` will just split them in half, distributing them into two chunks, no matter what. – Gunter Ohrner Dec 27 '21 at 13:39
14

This should do it for you - all your files will end up called Part1-Part500.

#!/bin/bash
FILENAME=10000.csv
HDR=$(head -1 $FILENAME)   # Pick up CSV header line to apply to each file
split -l 20 $FILENAME xyz  # Split the file into chunks of 20 lines each
n=1
for f in xyz*              # Go through all newly created chunks
do
   echo $HDR > Part${n}    # Write out header to new file called "Part(n)"
   cat $f >> Part${n}      # Add in the 20 lines from the "split" command
   rm $f                   # Remove temporary file
   ((n++))                 # Increment name of output part
done
Mark Setchell
  • 191,897
  • 31
  • 273
  • 432
  • This created the files but tabs (I mean \t) were replaced with spaces, and my CSV files were using \t as separator – AmineG Dec 11 '15 at 13:34
  • This can be improved slightly by padding the numbers so that the files are displayed consecutively automatically. By replacing the ${n} with $(printf "%05d\n" $n) – Phineas Dashevsky Jan 06 '16 at 21:46
  • 6
    Header line is duplicated in the first file. – Juha Palomäki Jun 19 '17 at 08:10
  • @JuhaPalomäki - Fixed that issue in an edit. – anoopjohn Dec 24 '21 at 01:02
  • Note that this has the same issue as [the most upvoted answer](https://stackoverflow.com/a/20721171/2859408), which also uses `split`: It will fail and mess up the results if your CSV contains cells with embedded newline characters. `split` will just split them in half, distributing them into two chunks, no matter what. – Gunter Ohrner Dec 27 '21 at 13:41
1

This question was asked many years ago, but for future readers I'd like to mention that the most convenient tool for this purpose is xsv from https://github.com/BurntSushi/xsv

The split sub-command is meant to do exactly what has been asked in the original question. The documentation says:

split - Split one CSV file into many CSV files of N chunks

Each of the split chunks retains the header row.

San
  • 518
  • 5
  • 14