3

I have a text file (A.in) and I want to split it into multiple files. The split should occur everytime an empty line is found. The filenames should be progressive (A1.in, A2.in, ..)

I found this answer that suggests using awk, but I can't make it work with my desired naming convention

awk -v RS="" '{print $0 > $1".txt"}' file

I also found other answers telling me to use the command csplit -l but I can't make it match empty lines, I tried matching the pattern '' but I am not that familiar with regex and I get the following

bash-3.2$ csplit A.in ""
csplit: : unrecognised pattern

Input file:

A.in

4 
RURDDD

6
RRULDD
KKKKKK

26
RRRULU

Desired output:

A1.in

4 
RURDDD

A2.in

6
RRULDD
KKKKKK

A3.in

26
RRRULU
kvantour
  • 25,269
  • 4
  • 47
  • 72
  • I am trying to improve my questions quality to match SOF standards, any feedback on the question would be highly appreciated – Alessandro Solbiati Dec 20 '18 at 09:49
  • 2
    This is a good question and close to SOF standards. The only thing missing is stating what goes wrong with your attempt. – kvantour Dec 20 '18 at 10:05
  • Right, 2 major things to avoid saying in a question on SO are "it doesn't work" (without an explanation of in what way "it doesn't work") and "I want a one-liner..." (because that implies you favor brevity over all of the things that **really** matter for software like coupling, cohesion, efficiency, portability, clarity, etc. and so would reject a good answer in favor of a brief answer). – Ed Morton Dec 20 '18 at 14:35

3 Answers3

4

Another fix for the awk:

$ awk -v RS="" '{
    split(FILENAME,a,".")  # separate name and extension
    f=a[1] NR "." a[2]     # form the filename, use NR as number
    print > f              # output to file
    close(f)               # in case there are MANY to avoid running out f fds
}' A.in
James Brown
  • 36,089
  • 7
  • 43
  • 59
  • what does NR stand for? – Alessandro Solbiati Dec 20 '18 at 10:03
  • 2
    Record number (well _number of input records awk has processed since the beginning of the program's execution_ based on GNU awk documentation). `RS=""` reparates the records on empty lines and `NR` is awk built-it variable for the count. – James Brown Dec 20 '18 at 10:05
  • 1
    @AlessandroSolbiati note that the solution of James is a bit more robust than mine. This deserves the accepted answer! – kvantour Dec 20 '18 at 10:06
  • 1
    @JamesBrown Consider it an early Christmas gift ;-) – kvantour Dec 20 '18 at 10:43
  • @kvantour You need to [start voting](https://winterbash2018.stackexchange.com/gonna-find-out), Santa (to get that hat). – James Brown Dec 20 '18 at 11:42
  • 2
    Just be aware the split() approach will fail if/when FILENAME contains more than one `.`. If that's a possibility then `base=sfx=FILENAME; sub(/\.[^.]+$/,"",base); sub(/.*\./,"",sfx); f=base NR "." sfx`. – Ed Morton Dec 20 '18 at 14:44
2

In any normal case, the following script should work:

awk 'BEGIN{RS=""}{ print > ("A" NR ".in") }' file

The reason why this might fail is most likely due to some CRLF terminations (See here and here).

As mentioned by James, making it a bit more robust as:

awk 'BEGIN{RS=""}{ f = "A" NR ".in"; print > f; close(f) }' file

If you want to use csplit, the following will do the trick:

csplit --suppress-matched  -f "A" -b "%0.2d.in" A.in '/^$/' '{*}'

See man csplit for understanding the above.

kvantour
  • 25,269
  • 4
  • 47
  • 72
  • works as intended. guess I was missing the syntax of BEGIN and c++ for the naming convention. Thank you very much – Alessandro Solbiati Dec 20 '18 at 10:02
  • 1
    @AlessandroSolbiati, most likely you missed the brackets for the redirection. Some versions of awk handle this differently if the brackets are missing. – kvantour Dec 20 '18 at 10:04
  • 1
    @AlessandroSolbiati I have added the `csplit` solution to the problem too ;-) – kvantour Dec 20 '18 at 10:20
0

Input file content:

$ cat A.in 
4 
RURDDD

6
RRULDD
KKKKKK

26
RRRULU

AWK file content:

BEGIN{
    n=1
}
{
    if(NF!=0){
        print $0 >> "A"n".in"
    }else{
        n++
    }
}

Execution:

awk -f ctrl.awk A.in

Output:

$ cat A1.in 
4 
RURDDD

$ cat A2.in 
6
RRULDD
KKKKKK

$ cat A3.in 
26
RRRULU

PS: One-liner execution without AWK file:

awk 'BEGIN{n=1}{if(NF!=0){print $0 >> "A"n".in"}else{n++}}' A.in
downtheroad
  • 409
  • 4
  • 11
  • That will fail with a syntax error in some awks due to unparenthesized right side of redirection and it will fail with a "too many open files" error in most awks due to not closing the output files as you go. – Ed Morton Dec 20 '18 at 14:43
  • @EdMorton I've tested the code with awk, gawk, mawk and nawk without errors – downtheroad Dec 21 '18 at 12:53
  • idk which awk you mean when you include just `awk` at the start of that list but there's no way `nawk` at least wouldn't fail when you get past about 20 output file names and here's one example of the syntax error using BSD awk version 20070501 on MacOS: `awk: syntax error at source line 1 context is BEGIN{n=1}{if(NF!=0){print $0 >> >>> "A"n <<< ".in"}else{n++}} awk: illegal statement at source line 1` – Ed Morton Dec 21 '18 at 14:34