Bash how to split file on empty line with awk

Question

I have a text file (A.in) and I want to split it into multiple files. The split should occur everytime an empty line is found. The filenames should be progressive (A1.in, A2.in, ..)

I found this answer that suggests using awk, but I can't make it work with my desired naming convention

awk -v RS="" '{print $0 > $1".txt"}' file

I also found other answers telling me to use the command csplit -l but I can't make it match empty lines, I tried matching the pattern '' but I am not that familiar with regex and I get the following

bash-3.2$ csplit A.in ""
csplit: : unrecognised pattern

Input file:

A.in

4 
RURDDD

6
RRULDD
KKKKKK

26
RRRULU

Desired output:

A1.in

4 
RURDDD

A2.in

6
RRULDD
KKKKKK

A3.in

26
RRRULU

I am trying to improve my questions quality to match SOF standards, any feedback on the question would be highly appreciated — Alessandro Solbiati, Dec 20 '18 at 09:49
This is a good question and close to SOF standards. The only thing missing is stating what goes wrong with your attempt. — kvantour, Dec 20 '18 at 10:05
Right, 2 major things to avoid saying in a question on SO are "it doesn't work" (without an explanation of in what way "it doesn't work") and "I want a one-liner..." (because that implies you favor brevity over all of the things that **really** matter for software like coupling, cohesion, efficiency, portability, clarity, etc. and so would reject a good answer in favor of a brief answer). — Ed Morton, Dec 20 '18 at 14:35

score 4 · Accepted Answer · answered Dec 20 '18 at 10:01

4

Another fix for the awk:

$ awk -v RS="" '{
    split(FILENAME,a,".")  # separate name and extension
    f=a[1] NR "." a[2]     # form the filename, use NR as number
    print > f              # output to file
    close(f)               # in case there are MANY to avoid running out f fds
}' A.in

answered Dec 20 '18 at 10:01

James Brown

36,089
7
43
59

what does NR stand for? – Alessandro Solbiati Dec 20 '18 at 10:03
2

Record number (well _number of input records awk has processed since the beginning of the program's execution_ based on GNU awk documentation). `RS=""` reparates the records on empty lines and `NR` is awk built-it variable for the count. – James Brown Dec 20 '18 at 10:05
1

@AlessandroSolbiati note that the solution of James is a bit more robust than mine. This deserves the accepted answer! – kvantour Dec 20 '18 at 10:06
1

@JamesBrown Consider it an early Christmas gift ;-) – kvantour Dec 20 '18 at 10:43
@kvantour You need to [start voting](https://winterbash2018.stackexchange.com/gonna-find-out), Santa (to get that hat). – James Brown Dec 20 '18 at 11:42
2

Just be aware the split() approach will fail if/when FILENAME contains more than one `.`. If that's a possibility then `base=sfx=FILENAME; sub(/\.[^.]+$/,"",base); sub(/.*\./,"",sfx); f=base NR "." sfx`. – Ed Morton Dec 20 '18 at 14:44

kvantour · Answer 2 · 2018-12-20T10:20:09.050

2

In any normal case, the following script should work:

awk 'BEGIN{RS=""}{ print > ("A" NR ".in") }' file

The reason why this might fail is most likely due to some CRLF terminations (See here and here).

As mentioned by James, making it a bit more robust as:

awk 'BEGIN{RS=""}{ f = "A" NR ".in"; print > f; close(f) }' file

If you want to use csplit, the following will do the trick:

csplit --suppress-matched  -f "A" -b "%0.2d.in" A.in '/^$/' '{*}'

See man csplit for understanding the above.

edited Dec 20 '18 at 10:20

answered Dec 20 '18 at 09:59

kvantour

25,269
4
47
72

works as intended. guess I was missing the syntax of BEGIN and c++ for the naming convention. Thank you very much – Alessandro Solbiati Dec 20 '18 at 10:02
1

@AlessandroSolbiati, most likely you missed the brackets for the redirection. Some versions of awk handle this differently if the brackets are missing. – kvantour Dec 20 '18 at 10:04
1

@AlessandroSolbiati I have added the `csplit` solution to the problem too ;-) – kvantour Dec 20 '18 at 10:20

score 0 · Answer 3 · answered Dec 20 '18 at 14:05

0

Input file content:

$ cat A.in 
4 
RURDDD

6
RRULDD
KKKKKK

26
RRRULU

AWK file content:

BEGIN{
    n=1
}
{
    if(NF!=0){
        print $0 >> "A"n".in"
    }else{
        n++
    }
}

Execution:

awk -f ctrl.awk A.in

Output:

$ cat A1.in 
4 
RURDDD

$ cat A2.in 
6
RRULDD
KKKKKK

$ cat A3.in 
26
RRRULU

PS: One-liner execution without AWK file:

awk 'BEGIN{n=1}{if(NF!=0){print $0 >> "A"n".in"}else{n++}}' A.in

answered Dec 20 '18 at 14:05

downtheroad

409
4
11

That will fail with a syntax error in some awks due to unparenthesized right side of redirection and it will fail with a "too many open files" error in most awks due to not closing the output files as you go. – Ed Morton Dec 20 '18 at 14:43
@EdMorton I've tested the code with awk, gawk, mawk and nawk without errors – downtheroad Dec 21 '18 at 12:53
idk which awk you mean when you include just `awk` at the start of that list but there's no way `nawk` at least wouldn't fail when you get past about 20 output file names and here's one example of the syntax error using BSD awk version 20070501 on MacOS: `awk: syntax error at source line 1 context is BEGIN{n=1}{if(NF!=0){print $0 >> >>> "A"n <<< ".in"}else{n++}} awk: illegal statement at source line 1` – Ed Morton Dec 21 '18 at 14:34

Bash how to split file on empty line with awk

3 Answers3

Linked