1

I am trying to split a big text files after n number of empty lines. The text file contains exactly one empty line as data separator. Like below:

Lorem ipsum
Lorem ipsum
Lorem ipsum

Lorem ipsum
Lorem ipsum

Lorem ipsum

Lorem ipsum
Lorem ipsum

Lorem
Lorem

...

I have tried to use csplit

csplit data.txt /^$/ {3}

My expectation is that after 3 empty lines (not consecutive, but after cursor processes 3 empty lines) it split file and continue to do so. But it actualy splits file in each empty line.

My expected files: xx00

Lorem ipsum
Lorem ipsum
Lorem ipsum

Lorem ipsum
Lorem ipsum

Lorem ipsum

xx01

Lorem ipsum
Lorem ipsum

Lorem
Lorem

Any suggestion?

gmtek
  • 741
  • 3
  • 7
  • 25
  • The problem you are having is a Regex applies to a LINE of data, not multiple lines. So the repetition `{3}` doesn't do what you want it to do. Another option is `awk` (or a bash script -- awk will be faster). In either case there you have the ability to use internal variables to keep count of the empty lines encountered. – David C. Rankin Jun 08 '22 at 07:15
  • _not consecutive, but after cursor processes 3 empty lines_ But is it possible that there are consecutive empty lines? – James Brown Jun 08 '22 at 07:19
  • Also, the output you show is inconsistent with a split at the 3rd newline. In that case `xx00` should not have the final 2 lines you show. `xx00` shows splitting the line on the 4th newline, which would remove the first two lines in `xx01`. – David C. Rankin Jun 08 '22 at 07:21
  • @DavidC.Rankin corrected the output. – gmtek Jun 08 '22 at 07:28

4 Answers4

2

With awk (tested with GNU and BSD awk):

awk -v max=3 '{print > sprintf("xx%02d", int(n/max))} /^$/ {n += 1}' file
Renaud Pacalet
  • 25,260
  • 3
  • 34
  • 51
2

This awk should also work with an empty RS:

awk -v n=3 -v RS= '{ORS=RT; print > sprintf("xx%02d", int((NR-1)/n))}' file
anubhava
  • 761,203
  • 64
  • 569
  • 643
0

awk is good for this.

Split every n empty lines, naming files with:

No leading zeroes:

awk -v n=3 '
$0 == "" {++c}
c <= n {print > "xx"f}
c==n {c=0; ++f}'

width minimum width/zeroes:

awk -v n=3 -v width=2 '
$0 == "" {++c}
c <= n {print > "xx"f}
c==n {c=0; ++f; f = sprintf("%0*d",width,f)}'

To remove the trailing empty line in each file, just change c <= n to c < n.

dan
  • 4,846
  • 6
  • 15
0
removed './xx00'
removed './xx01'
removed './awkprof.out'

    {m,g}awk '{
        print >> sprintf("xx%0*.f%.*s", __-(_~_),
                 int(_/__),_<_,_+=!NF) }' FS='^$' __=3

-rw-r--r--  1 501  75 Jun  8 09:19:10 2022 xx00
-rw-r--r--  1 501  37 Jun  8 09:19:10 2022 xx01


../../Desktop/testdiremptylines/

     1  Lorem ipsum
     2  Lorem ipsum
     3  Lorem ipsum
     4  
     5  Lorem ipsum
     6  Lorem ipsum
     7  
     8  Lorem ipsum
     9  

 xx00

     1  Lorem ipsum
     2  Lorem ipsum
     3  
     4  Lorem
     5  Lorem

 xx01
RARE Kpop Manifesto
  • 2,453
  • 3
  • 11