22

I have a file, say all, with 2000 lines, and I hope it can be split into 4 small files with line number 1~500, 501~1000, 1001~1500, 1501~2000.

Perhaps, I can do this using:

cat all | head -500 >small1
cat all | tail -1500 | head -500 >small2
cat all | tail -1000 | head -500 >small3
cat all | tail -500 >small4

But this way involves the calculation of line number, which may cause error when the number of lines is not a good number, or when we want to split the file to too many small files (e.g.: file all with 3241 lines, and we want to split it into 7 files, each with 463 lines).

Is there a better way to do this?

Yishu Fang
  • 9,448
  • 21
  • 65
  • 102

3 Answers3

39

When you want to split a file, use split:

split -l 500 all all

will split the file into several files that each have 500 lines. If you want to split the file into 4 files of roughly the same size, use something like:

split -l $(( $( wc -l < all ) / 4 + 1 )) all all
William Pursell
  • 204,365
  • 48
  • 270
  • 300
  • 2
    It is interesting to note that obtaining always the same number of files with “split -l” seems to me impossible, even if there are enough lines in the original set. Consider for example a set of 5 lines to split into 4 files. Asking to place 1 line per file will produce 5 files, not 4. Asking to place 2 lines per file will produce 3 files, not 4. "split -n" might be the correct way of doing that, if your version of "split" has it. – Sébastien Piérard Apr 26 '15 at 14:20
  • According to `man split` you can also do `split -n l/4` if you want 4 files of roughly similar size "without splitting lines/records" – bluu Dec 21 '21 at 11:30
10

Look into the split command, it should do what you want (and more):

$ split --help
Usage: split [OPTION]... [INPUT [PREFIX]]
Output fixed-size pieces of INPUT to PREFIXaa, PREFIXab, ...; default
size is 1000 lines, and default PREFIX is 'x'.  With no INPUT, or when INPUT
is -, read standard input.

Mandatory arguments to long options are mandatory for short options too.
  -a, --suffix-length=N   generate suffixes of length N (default 2)
      --additional-suffix=SUFFIX  append an additional SUFFIX to file names.
  -b, --bytes=SIZE        put SIZE bytes per output file
  -C, --line-bytes=SIZE   put at most SIZE bytes of lines per output file
  -d, --numeric-suffixes[=FROM]  use numeric suffixes instead of alphabetic.
                                   FROM changes the start value (default 0).
  -e, --elide-empty-files  do not generate empty output files with '-n'
      --filter=COMMAND    write to shell COMMAND; file name is $FILE
  -l, --lines=NUMBER      put NUMBER lines per output file
  -n, --number=CHUNKS     generate CHUNKS output files.  See below
  -u, --unbuffered        immediately copy input to output with '-n r/...'
      --verbose           print a diagnostic just before each
                            output file is opened
      --help     display this help and exit
      --version  output version information and exit

SIZE is an integer and optional unit (example: 10M is 10*1024*1024).  Units
are K, M, G, T, P, E, Z, Y (powers of 1024) or KB, MB, ... (powers of 1000).

CHUNKS may be:
N       split into N files based on size of input
K/N     output Kth of N to stdout
l/N     split into N files without splitting lines
l/K/N   output Kth of N to stdout without splitting lines
r/N     like 'l' but use round robin distribution
r/K/N   likewise but only output Kth of N to stdout
John Brodie
  • 5,909
  • 1
  • 19
  • 29
4

Like the others have already mentioned, you could use split. The complicated command substitution that the accepted answer mentions is not necessary. For reference I'm adding the following commands, which accomplish almost what has been request. Note that when using -n command-line argument to specify the number of chucks, the small* files do not contain exactly 500 lines when using split.

$ seq 2000 > all
$ split -n l/4 --numeric-suffixes=1 --suffix-length=1 all small
$ wc -l small*
 583 small1
 528 small2
 445 small3
 444 small4
2000 total

Alternatively, you could use GNU parallel:

$ < all parallel -N500 --pipe --cat cp {} small{#}
$ wc -l small*
 500 small1
 500 small2
 500 small3
 500 small4
2000 total

As you can see, this incantation is quite complex. GNU Parallel is actually most-often used for parallelizing pipelines. IMHO a tool worth looking into.