How can I split a large text file into smaller files with an equal number of lines?

Question

I've got a large (by number of lines) plain text file that I'd like to split into smaller files, also by number of lines. So if my file has around 2M lines, I'd like to split it up into 10 files that contain 200k lines, or 100 files that contain 20k lines (plus one file with the remainder; being evenly divisible doesn't matter).

I could do this fairly easily in Python, but I'm wondering if there's any kind of ninja way to do this using Bash and Unix utilities (as opposed to manually looping and counting / partitioning lines).

Out of curiousity, after they're "split", how does one "combine" them? Something like "cat part2 >> part1"? Or is there another ninja utility? mind updating your question? — dlamotte, Jan 06 '10 at 22:47
yes cat is short for concatenate. In general apropos is useful for finding appropriate commands. I.E. see the output of: apropos split — pixelbeat, Jan 06 '10 at 22:51
As an aside, OS X users should make sure their file contains LINUX or UNIX-style Line breaks/End-Of-Line indicators (LF) instead of MAC OS X - style end-of-line indicators (CR) - the split and csplit commands will not work if your like breaks are Carriage Returns instead of LineFeeds. TextWrangler from BareBones software can help you with this if you're on Mac OS. You can choose how you want your line break characters look. when you save (or Save As...) your text files. — , Oct 21 '12 at 21:34
binary version: http://unix.stackexchange.com/questions/1588/break-a-large-file-into-smaller-pieces — Ciro Santilli OurBigBook.com, Apr 26 '16 at 12:22

score 1095 · Accepted Answer · edited Aug 11 '21 at 23:08

1095

Have a look at the split command:

$ split --help
Usage: split [OPTION] [INPUT [PREFIX]]
Output fixed-size pieces of INPUT to PREFIXaa, PREFIXab, ...; default
size is 1000 lines, and default PREFIX is `x'.  With no INPUT, or when INPUT
is -, read standard input.

Mandatory arguments to long options are mandatory for short options too.
  -a, --suffix-length=N   use suffixes of length N (default 2)
  -b, --bytes=SIZE        put SIZE bytes per output file
  -C, --line-bytes=SIZE   put at most SIZE bytes of lines per output file
  -d, --numeric-suffixes  use numeric suffixes instead of alphabetic
  -l, --lines=NUMBER      put NUMBER lines per output file
      --verbose           print a diagnostic to standard error just
                            before each output file is opened
      --help     display this help and exit
      --version  output version information and exit

You could do something like this:

split -l 200000 filename

which will create files each with 200000 lines named xaa xab xac ...

Another option, split by size of output file (still splits on line breaks):

 split -C 20m --numeric-suffixes input_filename output_prefix

creates files like output_prefix01 output_prefix02 output_prefix03 ... each of maximum size 20 megabytes.

edited Aug 11 '21 at 23:08

Peter Mortensen

30,738
21
105
131

answered Jan 06 '10 at 22:44

Mark Byers

811,555
193
1,581
1,452

27

you can also split a file by size: `split -b 200m filename` (m for megabytes, k for kilobytes or no suffix for bytes) – Abhi Beckert Jun 24 '11 at 07:55
156

split by size and ensure files are split on line breaks: split -C 200m filename – Clayton Stanley Dec 13 '12 at 02:12
2

split produces garbled output with Unicode (UTF-16) input. At least on Windows with the version I have. – Vertigo May 24 '13 at 07:57
Using `split data.csv` in OSX 10.8.4 to separate a 5k line file just produces an identical file named `xaa`.. – geotheory Aug 16 '13 at 13:01
geotheory, you need to pass parameters into the command to tell it how to divide the file. Try `split -l 1000 data.csv` and it'll divide your 5000 file into five 1000 line files called xaa, xab, xac, xad and xae. – Alistair McMillan Aug 22 '13 at 14:04
4

@geotheory, be sure to follow LeberMac's advice earlier in the thread about first converting CR (Mac) line endings to LR (Linux) line endings using TextWrangler or BBEdit. I had the exact same problem as you until I found that piece of advice. – sstringer Aug 25 '13 at 20:00
Thanks both. The default 1000 lines works without need for specification. But sstringer rightly identifies LeberMac's solution – geotheory Aug 25 '13 at 20:26
1

And to join them back together? `join` I guess, but do I have to supply the file1 file2 etc in a curtain order? All I have now are files like `xaa, xab, xad...` – Ian Vaughan Oct 10 '13 at 12:01
1

Be careful when using the `-d` option (numeric suffixes) together with the default suffix length (which is 2), as split then stops at `x99` (at least on version 8.4). When you want numeric suffixes, always specify a suffix length with `-a`. – tttthomasssss Dec 12 '15 at 08:46
1

But how can I do this while _maintaining_ the header? Downstream each file is processed by a R-script that for various reasons requires the presence of the original column names... Thanks! May have found the answer: http://stackoverflow.com/questions/1411713/how-to-split-a-file-and-keep-the-first-line-in-each-of-the-pieces – Sander W. van der Laan Dec 02 '16 at 09:29
13

`-d` option is not available on OSX, use `gsplit` instead. Hope this useful for Mac user. – user5698801 Jul 23 '17 at 11:54
Thanks. Split command helped me to create files with defined number of line and/or size as well. – Sohel Pathan May 15 '18 at 05:50
If you have some copy-pasta to do after, this could be a useful followup `gedit x*` – Stefanos Chrs Nov 12 '21 at 11:50
I think you should add that obviously the last file will most likely have less than n lines. The current formulation might cause concerns as to what happens to the remainder. – Radio Controlled Nov 20 '21 at 07:36

score 108 · Answer 2 · edited Aug 11 '21 at 23:19

108

Use the split command:

split -l 200000 mybigfile.txt

edited Aug 11 '21 at 23:19

Peter Mortensen

30,738
21
105
131

answered Jan 06 '10 at 22:45

Robert Christie

20,177
8
42
37

And can we set the maximum number of outputs? for example split that big file but don't exceed 50 output; even if there are remained lines in the big file – Dr.jacky Apr 25 '23 at 16:57

score 50 · Answer 3 · edited Jul 16 '18 at 11:16

50

Yes, there is a split command. It will split a file by lines or bytes.

$ split --help
Usage: split [OPTION]... [INPUT [PREFIX]]
Output fixed-size pieces of INPUT to PREFIXaa, PREFIXab, ...; default
size is 1000 lines, and default PREFIX is `x'.  With no INPUT, or when INPUT
is -, read standard input.

Mandatory arguments to long options are mandatory for short options too.
  -a, --suffix-length=N   use suffixes of length N (default 2)
  -b, --bytes=SIZE        put SIZE bytes per output file
  -C, --line-bytes=SIZE   put at most SIZE bytes of lines per output file
  -d, --numeric-suffixes  use numeric suffixes instead of alphabetic
  -l, --lines=NUMBER      put NUMBER lines per output file
      --verbose           print a diagnostic just before each
                            output file is opened
      --help     display this help and exit
      --version  output version information and exit

SIZE may have a multiplier suffix:
b 512, kB 1000, K 1024, MB 1000*1000, M 1024*1024,
GB 1000*1000*1000, G 1024*1024*1024, and so on for T, P, E, Z, Y.

edited Jul 16 '18 at 11:16

TRiG

10,148
7
57
107

answered Jan 06 '10 at 22:46

Dave Kirby

25,806
5
67
84

Tried georgec@ATGIS25 ~ $ split -l 100000 /cygdrive/P/2012/Job_044_DM_Radio_Propogation/Working/FinalPropogation/TRC_Longlands/trc_longlands.txt but there are no split files in the directory -where is the output? – GeorgeC Mar 08 '12 at 04:05
1

It should be in the same directory. E.g. if I want to split by 1,000,000 lines per file, do the following: `split -l 1000000 train_file train_file.` and in the same directory I'll get `train_file.aa` with the first million, then `trail_file.ab` with the next million, etc. – Will Feb 08 '15 at 21:49
2

@GeorgeC and you can get custom output directories with the prefix: `split input my/dir/`. – Ciro Santilli OurBigBook.com Apr 24 '16 at 20:56

score 26 · Answer 4 · edited Oct 12 '22 at 13:50

To split a large text file into smaller files of 1000 lines each:

split <file> -l 1000

To split a large binary file into smaller files of 10M each:

split <file> -b 10M

To consolidate split files into a single file:

cat x* > <file>

Split a file, each split having 10 lines (except the last split):

split -l 10 filename

Split a file into 5 files. File is split such that each split has same size (except the last split):

split -n 5 filename

Split a file with 512 bytes in each split (except the last split; use 512k for kilobytes and 512m for megabytes):

split -b 512 filename

Split a file with at most 512 bytes in each split without breaking lines:

split -C 512 filename

n files with the same number of lines appears to need `wc` unfortunately: https://stackoverflow.com/questions/3194349/how-do-i-split-a-file-into-n-no-of-parts — Ciro Santilli OurBigBook.com, Jul 23 '23 at 19:50

score 20 · Answer 5 · edited Aug 11 '21 at 23:30

20

Split the file "file.txt" into 10,000-lines files:

split -l 10000 file.txt

edited Aug 11 '21 at 23:30

Peter Mortensen

30,738
21
105
131

answered Feb 27 '18 at 09:11

ialqwaiz

199
1
4

score 18 · Answer 6 · edited Aug 11 '21 at 23:09

18

Use split:

Split a file into fixed-size pieces, creates output files containing consecutive sections of INPUT (standard input if none is given or INPUT is `-')

Syntax split [options] [INPUT [PREFIX]]

edited Aug 11 '21 at 23:09

Peter Mortensen

30,738
21
105
131

answered Jan 06 '10 at 22:44

zmbush

2,790
1
17
35

score 15 · Answer 7 · edited Aug 11 '21 at 23:20

15

You can also use AWK:

awk -vc=1 'NR%200000==0{++c}{print $0 > c".txt"}' largefile

edited Aug 11 '21 at 23:20

Peter Mortensen

30,738
21
105
131

answered Jan 07 '10 at 01:03

ghostdog74

327,991
56
259
343

4

`awk -v lines=200000 -v fmt="%d.txt" '{print>sprintf(fmt,1+int((NR-1)/lines))}'` – Mark Edgar Jan 07 '10 at 06:52
with `prefix`: `awk -vc=1 'NR%200000==0{++c}{print $0 > "prefix"c".txt"}' largefile` – 7beggars_nnnnm Oct 28 '21 at 15:19

score 15 · Answer 8 · edited Apr 27 '16 at 16:39

15

Use:

sed -n '1,100p' filename > output.txt

Here, 1 and 100 are the line numbers which you will capture in output.txt.

edited Apr 27 '16 at 16:39

tripleee

175,061
34
275
318

answered Apr 21 '16 at 09:27

Harshwardhan

215
2
5

This only obtains the first 100 lines, you need to loop it to successively split the file into the next 101..200 etc. Or just use `split` like all the top answers here already tell you. – tripleee Feb 01 '19 at 09:34
This was actually what I was looking for! – Paiman Roointan Jan 04 '23 at 10:55

Denilson Sá Maia · Answer 9 · 2018-05-31T10:40:23.377

split (from GNU coreutils, since version 8.8 from 2010-12-22) includes the following parameter:

-n, --number=CHUNKS     generate CHUNKS output files; see explanation below

CHUNKS may be:
  N       split into N files based on size of input
  K/N     output Kth of N to stdout
  l/N     split into N files without splitting lines/records
  l/K/N   output Kth of N to stdout without splitting lines/records
  r/N     like 'l' but use round robin distribution
  r/K/N   likewise but only output Kth of N to stdout

Thus, split -n 4 input output. will generate four files (output.a{a,b,c,d}) with the same amount of bytes, but lines might be broken in the middle.

If we want to preserve full lines (i.e. split by lines), then this should work:

split -n l/4 input output.

Related answer: https://stackoverflow.com/a/19031247

score 12 · Answer 10 · edited Aug 11 '21 at 23:22

In case you just want to split by x number of lines each file, the given answers about split are OK. But, I am curious about why no one paid attention to the requirements:

"without having to count them" -> using wc + cut
"having the remainder in extra file" -> split does by default

I can't do that without "wc + cut", but I'm using that:

split -l  $(expr `wc $filename | cut -d ' ' -f3` / $chunks) $filename

This can be easily added to your .bashrc file functions, so you can just invoke it, passing the filename and chunks:

 split -l  $(expr `wc $1 | cut -d ' ' -f3` / $2) $1

In case you want just x chunks without remainder in the extra file, just adapt the formula to sum it (chunks - 1) on each file. I do use this approach because usually I just want x number of files rather than x lines per file:

split -l  $(expr `wc $1 | cut -d ' ' -f3` / $2 + `expr $2 - 1`) $1

You can add that to a script and call it your "ninja way", because if nothing suites your needs, you can build it :-)

Or, just use the `-n` option of `split`. – Amit Naidu Jun 20 '19 at 00:00 — Amit Naidu, Jun 20 '19 at 00:00

score 3 · Answer 11 · answered Dec 17 '22 at 07:21

3

Here an example dividing the file "toSplit.txt" into smaller files of 200 lines named "splited00.txt", splited01.txt, ... , "splited25.txt" ...

split -l 200 --numeric-suffixes --additional-suffix=".txt" toSplit.txt splited

answered Dec 17 '22 at 07:21

Eric S.

63
1
5

This does not provide an answer to the question. Once you have sufficient [reputation](https://stackoverflow.com/help/whats-reputation) you will be able to [comment on any post](https://stackoverflow.com/help/privileges/comment); instead, [provide answers that don't require clarification from the asker](https://meta.stackexchange.com/questions/214173/why-do-i-need-50-reputation-to-comment-what-can-i-do-instead). - [From Review](/review/late-answers/33436288) – ahuemmer Dec 20 '22 at 11:07

score 2 · Answer 12 · edited Aug 11 '21 at 23:37

HDFS getmerge small file and split into a proper size.

This method will cause line breaks:

split -b 125m compact.file -d -a 3 compact_prefix

I try to getmerge and split into about 128 MB for every file.

# Split into 128 MB, and judge sizeunit is M or G. Please test before use.

begainsize=`hdfs dfs -du -s -h /externaldata/$table_name/$date/ | awk '{ print $1}' `
sizeunit=`hdfs dfs -du -s -h /externaldata/$table_name/$date/ | awk '{ print $2}' `
if [ $sizeunit = "G" ];then
    res=$(printf "%.f" `echo "scale=5;$begainsize*8 "|bc`)
else
    res=$(printf "%.f" `echo "scale=5;$begainsize/128 "|bc`)  # Celling ref http://blog.csdn.net/naiveloafer/article/details/8783518
fi
echo $res
# Split into $res files with a number suffix. Ref:  http://blog.csdn.net/microzone/article/details/52839598
compact_file_name=$compact_file"_"
echo "compact_file_name: "$compact_file_name
split -n l/$res $basedir/$compact_file -d -a 3 $basedir/${compact_file_name}

What is "HDFS"? [Hadoop distributed file system](https://en.wikipedia.org/wiki/Apache_Hadoop#Hadoop_distributed_file_system)? Or something else? Can you provide a reference to it? — Peter Mortensen, Aug 11 '21 at 23:29
What are "celling" and "begain"? Is the latter "begin" (or "start")? — Peter Mortensen, Aug 11 '21 at 23:42

How can I split a large text file into smaller files with an equal number of lines?

12 Answers12

To split a large text file into smaller files of 1000 lines each:

To split a large binary file into smaller files of 10M each:

To consolidate split files into a single file:

Split a file, each split having 10 lines (except the last split):

Split a file into 5 files. File is split such that each split has same size (except the last split):

Split a file with 512 bytes in each split (except the last split; use 512k for kilobytes and 512m for megabytes):

Split a file with at most 512 bytes in each split without breaking lines:

Linked

Related