11

In GNU Awk's 4.1.2 Record Splitting with gawk we can read:

When RS is a single character, RT contains the same single character. However, when RS is a regular expression, RT contains the actual input text that matched the regular expression.

This variable RT is very useful in some cases.

Similarly, we can set a regular expression as the field separator. For example, in here we allow it to be either ";" or "|":

$ gawk -F';' '{print NF}' <<< "hello;how|are you"
2  # there are 2 fields, since ";" appears once
$ gawk -F'[;|]' '{print NF}' <<< "hello;how|are you"
3  # there are 3 fields, since ";" appears once and "|" also once

However, if we want to pack the data again, we don't have a way to know which separator appeared between two fields. So if in the previous example I want to loop through the fields and print them together again by using FS, it prints the whole expression in every case:

$ gawk -F'[;|]' '{for (i=1;i<=NF;i++) printf ("%s%s", $i, FS)}' <<< "hello;how|are you"
hello[;|]how[;|]are you[;|]  # a literal "[;|]" shows in the place of FS

Is there a way to "repack" the fields using the specific field separator used to split each one of them, similarly to what RT would allow to do?

(the examples given in the question are rather simple, but just to show the point)

tripleee
  • 175,061
  • 34
  • 275
  • 318
fedorqui
  • 275,237
  • 103
  • 548
  • 598

3 Answers3

8

Is there a way to "repack" the fields using the specific field separator used to split each one of them

Using gnu-awk split() that has an extra 4th parameter for the matched delimiter using supplied regex:

s="hello;how|are you"
awk 'split($0, flds, /[;|]/, seps) {for (i=1; i in seps; i++) printf "%s%s", flds[i], seps[i]; print flds[i]}' <<< "$s"

hello;how|are you

A more readable version:

s="hello;how|are you"
awk 'split($0, flds, /[;|]/, seps) {
   for (i=1; i in seps; i++)
      printf "%s%s", flds[i], seps[i]
   print flds[i]
}' <<< "$s"

Take note of 4th seps parameter in split that stores an array of matched text by regular expression used in 3rd parameter i.e. /[;|]/.

Of course it is not as short & simple as RS, ORS and RT, which can be written as:

awk -v RS='[;|]' '{ORS = RT} 1' <<< "$s"
anubhava
  • 761,203
  • 64
  • 569
  • 643
  • 1
    Or If you only want the separators, following @anubhava's code: `awk '{n=split($0,s,/[;|]/,seps); for(i=1;i – Carlos Pascual Jan 04 '21 at 10:04
  • 2
    The other thing you should probably change (or discuss) is the loop range - the `seps` array can start at 0 when using the defaullt FS because seps[0] then holds the white space that occurs before flds[1] and is normally discarded during field splitting. – Ed Morton Jan 04 '21 at 14:53
  • ok so I am trying to understand more. Would `seps[0]` ever have any value for something like: `split($0, flds, /[[:blank:]]+/, seps)` – anubhava Jan 04 '21 at 15:00
  • 1
    No, only for `split($0, flds, " ", seps)` because the FS value of `" "` is actually a metacharacter for field splitting. I'll post a POSIX equivalent of what gawk split() does to help clarify - one sec... – Ed Morton Jan 04 '21 at 15:36
  • 1
    Done, see https://stackoverflow.com/a/65565440/1745001 – Ed Morton Jan 04 '21 at 15:42
  • Interesting so `split($0, flds, " ", seps)` behaves differently in comparison with `split($0, flds, / /, seps)` – anubhava Jan 04 '21 at 15:50
  • 1
    Yes because the constant regexp `/ /` isn't the character `" "`. `"[ ]"` also behaves differently from `" "` - it's the portable way to split on a single blank char. – Ed Morton Jan 04 '21 at 16:00
  • 3
    It's may be worth mentioning here that you can't pass a constant regexp to a user-defined function so while you can do `split($0,arr,/re/)` you can't write your own function `foo()` and do `foo($0,arr,/re/)`, you have to call it as `foo($0,arr,"re")` instead using a dynamic regexp this time because `/re/` in that context means `($0 ~ /re/ ? 1 : 0)`. GNU awk has an enhancement called strongly typed regexps that works around that problem by prefixing the constant regexp with `@`, e.g. `foo($0,arr,@/re/)` - see https://www.gnu.org/software/gawk/manual/gawk.html#Strong-Regexp-Constants – Ed Morton Jan 04 '21 at 16:10
  • 2
    Wow, that's good to know. I never knew about `@/re/` – anubhava Jan 04 '21 at 16:24
  • 1
    Wow, such an interesting debate you had here. Note @EdMorton that Strongly Typed Regexp are not allowed in split, as indicated in the link you provide: `gawk -v patt=@/;/ '{print split($0, a, patt)}' <<< "ha;he;hi"` fails, for example. – fedorqui Jan 07 '21 at 10:15
  • 1
    Worth mentioning the usage of these in strongly typed regexp given by Ed in [How to check the type of an awk variable?](https://stackoverflow.com/a/46667839/1983854) – fedorqui Jan 07 '21 at 10:16
  • 1
    Would it be interesting to create a canonical post about Strongly typed regexp constants? May be useful to make more people known about them – fedorqui Jan 07 '21 at 10:26
  • 2
    @fedorqui'SOstopharming' they are allowed as an arg to `split()`, you just forgot to quote the string you used to init your awk variable, try `gawk -v patt='@/;/' '{print split($0, a, patt)}' <<< "ha;he;hi"` so the `;` in the middle of it terminated the command line. instead. **ALWAYS** quote strings and scripts in the shell unless you have a specific reason why you **need** to not do so. – Ed Morton Jan 07 '21 at 17:05
  • 2
    Ooops you are right, @EdMorton ! In fact now I notice that I understood the docs the other way: it is in split and others, where it _can_ be used (_Strongly typed regexp constants cannot be used everywhere that a regular regexp constant can, because this would make the language even more confusing. Instead, you may use them only in certain contexts_) – fedorqui Jan 07 '21 at 17:08
  • OK so I went ahead and posted in [How do strongly typed regexp constants work in GNU Awk?](https://stackoverflow.com/q/65617751/1983854). If anyone wants to put their two cents and make this element more known to people... – fedorqui Jan 07 '21 at 18:11
5

As @anubhava mentions, gawk has split() (and patsplit() which is to FPAT as split() is to FS - see https://www.gnu.org/software/gawk/manual/gawk.html#String-Functions) to do what you want. If you want the same functionality with a POSIX awk then:

$ cat tst.awk
function getFldsSeps(str,flds,fs,seps,  nf) {
    delete flds
    delete seps
    str = $0

    if ( fs == " " ) {
        fs = "[[:space:]]+"
        if ( match(str,"^"fs) ) {
            seps[0] = substr(str,RSTART,RLENGTH)
            str = substr(str,RSTART+RLENGTH)
        }
    }

    while ( match(str,fs) ) {
        flds[++nf] = substr(str,1,RSTART-1)
        seps[nf]   = substr(str,RSTART,RLENGTH)
        str = substr(str,RSTART+RLENGTH)
    }

    if ( str != "" ) {
        flds[++nf] = str
    }

    return nf
}

{
    print
    nf = getFldsSeps($0,flds,FS,seps)
    for (i=0; i<=nf; i++) {
        printf "{%d:[%s]<%s>}%s", i, flds[i], seps[i], (i<nf ? "" : ORS)
    }
}

Note the specific handling above of the case where the field separator is " " because that means 2 things different from all other field separator values:

  1. Fields are actually separated by chains of any white space, and
  2. Leading white space is to be ignored when populating $1 (or flds[1] in this case) and so that white space, if it exists, must be captured in seps[0]` for our purposes since every seps[N] is associated with the flds[N] that precedes it.

For example, running the above on these 3 input files:

$ head file{1..3}
==> file1 <==
hello;how|are you

==> file2 <==
hello how are_you

==> file3 <==
    hello how are_you

we'd get the following output where each field is displayed as the field number then the field value within [...] then the separator within <...>, all within {...} (note that seps[0] is populated IFF the FS is " " and the record starts with white space):

$ awk -F'[,|]' -f tst.awk file1
hello;how|are you
{0:[]<>}{1:[hello;how]<|>}{2:[are you]<>}

$ awk -f tst.awk file2
hello how are_you
{0:[]<>}{1:[hello]< >}{2:[how]< >}{3:[are_you]<>}

$ awk -f tst.awk file3
    hello how are_you
{0:[]<    >}{1:[hello]< >}{2:[how]< >}{3:[are_you]<>}
Ed Morton
  • 188,023
  • 17
  • 78
  • 185
3

An alternative option to split is to use match to find the field separators and read them into an array:

awk -F'[;|]' '{
    str=$0; # Set str to the line
    while (match(str,FS)) { # Loop through rach match of the field separator
      map[cnt+=1]=substr(str,RSTART,RLENGTH); # Create an array of the field separators
      str=substr(str,RSTART+RLENGTH) # Set str to the rest of the string after the match string
    }
    for (i=1;i<=NF;i++) { 
      printf "%s%s",$i,map[i] # Loop through each record, printing it along with the field separator held in the array map.
    } 
    printf "\n" 
   }' <<< "hello;how|are you"
anubhava
  • 761,203
  • 64
  • 569
  • 643
Raman Sailopal
  • 12,320
  • 2
  • 11
  • 18
  • 2
    jfyi this is applying same regex twice. Once for field splitting ant second time for capturing delimiters. If splitting regex is complex then it will slow it down. – anubhava Jan 04 '21 at 10:43