Split a text file using gsplit on a delimiter on OSX Mojave

Question

Have searched many answers for hours but none have helped me use gsplit with a delimiter. Frustrating that there is no well explained answer to this in 2020. So far i've tried:

first i install coreutils:

brew install coreutils

then i run this command which works at splitting by 5000 lines.. However i need it to split by a delimiter, not 5000 lines.

gsplit -l 5000 -d --additional-suffix=.txt $FileName file

I can't find anything in the help file about how to split by a delimiter, any delimiter like 'abc' for example. And there are so many answers on here that simply dont explain how to get some other utility they use to work (awk or gawk??) with no explanation of how to install it or what operating system they use etc..

My file (myfile.txt) that i want to split with the 'abc' delimeter looks like this:

myfile.txt:

randomHTML
randomHTML
randomHTML
randomHTML
abc
randomHTML
abc
randomHTML
randomJS
randomHTML
randomHTML
abc
randomHTML
randomJS
abc

There's no mention of a delimiter in the gsplit help

gsplit --help
Usage: gsplit [OPTION]... [FILE [PREFIX]]
Output pieces of FILE to PREFIXaa, PREFIXab, ...;
default size is 1000 lines, and default PREFIX is 'x'.

With no FILE, or when FILE is -, read standard input.

Mandatory arguments to long options are mandatory for short options too.
  -a, --suffix-length=N   generate suffixes of length N (default 2)
      --additional-suffix=SUFFIX  append an additional SUFFIX to file names
  -b, --bytes=SIZE        put SIZE bytes per output file
  -C, --line-bytes=SIZE   put at most SIZE bytes of records per output file
  -d                      use numeric suffixes starting at 0, not alphabetic
      --numeric-suffixes[=FROM]  same as -d, but allow setting the start value
  -x                      use hex suffixes starting at 0, not alphabetic
      --hex-suffixes[=FROM]  same as -x, but allow setting the start value
  -e, --elide-empty-files  do not generate empty output files with '-n'
      --filter=COMMAND    write to shell COMMAND; file name is $FILE
  -l, --lines=NUMBER      put NUMBER lines/records per output file
  -n, --number=CHUNKS     generate CHUNKS output files; see explanation below
  -t, --separator=SEP     use SEP instead of newline as the record separator;
                            '\0' (zero) specifies the NUL character
  -u, --unbuffered        immediately copy input to output with '-n r/...'
      --verbose           print a diagnostic just before each
                            output file is opened
      --help     display this help and exit
      --version  output version information and exit

The SIZE argument is an integer and optional unit (example: 10K is 10*1024).
Units are K,M,G,T,P,E,Z,Y (powers of 1024) or KB,MB,... (powers of 1000).
Binary prefixes can be used, too: KiB=K, MiB=M, and so on.

CHUNKS may be:
  N       split into N files based on size of input
  K/N     output Kth of N to stdout
  l/N     split into N files without splitting lines/records
  l/K/N   output Kth of N to stdout without splitting lines/records
  r/N     like 'l' but use round robin distribution
  r/K/N   likewise but only output Kth of N to stdout

GNU coreutils online help: <https://www.gnu.org/software/coreutils/>
Full documentation <https://www.gnu.org/software/coreutils/split>
or available locally via: info '(coreutils) split invocation'

Sounds like you are looking for `csplit`, whrch is probably installed as `gcsplit` in your environment, if you want the GNU version instead of the BSD version which is available out of the box. — tripleee, Jul 14 '20 at 15:33

score 0 · Accepted Answer · answered Jul 14 '20 at 15:23

0

How about:

awk -F\(abc\) 'RS="^$" { for (i=1;i<NF;i++) { system("echo \""$i"\" > "i"-abc.txt") } }' abc.txt

We remove the record separator so we can process the whole file as one record. Then we set "abc" as the delimiter and then we look through each record and use the system command to echo out record to a file names abc prefixed with the number of the record.

abc.txt holds the original data

answered Jul 14 '20 at 15:23

Raman Sailopal

12,320
2
11
18

Please note that the system function within awk will spawn a separate process each time and so the bigger the file, the more processes spawned. – Raman Sailopal Jul 14 '20 at 15:27
Why would you use `system("echo")` over the built-in `print` anyway? – tripleee Jul 14 '20 at 15:35
it works at least.. but only for the last 10 occurences of the abc delimiter – AGrush Jul 14 '20 at 15:44
You may need to add an empty line to the start of the file. – Raman Sailopal Jul 14 '20 at 15:56
ah i think i'm getting loads of errors because the file is full of html content, should have mentioned the file is fulll of html & js – AGrush Jul 14 '20 at 16:01

Split a text file using gsplit on a delimiter on OSX Mojave

1 Answers1