5

TL (see TL;DR near the end of the question)

I came about this data with pipes as field delimiters (|) and backslash-quote pairs as quotes (\") to fields with delimiters in the data, such as:

1|\"2\"|\"3.1|3.2\"|4  # basically 1, 2, 3.1|3.2, 4

that is (in awk):

$1==1
$2==\"2\"
$3==\"3.1|3.2\"
$4==4

I decided to try and use GNU awk's FPAT to solve the field issue since writing a negative match regex to \" didn't seem that bad.

I came about this answer to Regular expression to match a line that doesn't contain a word with a link to (an offsite link) an online generator of negative regular expressions given an input phrase.

As the generator supports only alphanumeric and space characters currently, \" (backslash-quote) was replaced with bq and the generator provided regex:

^([^b]|b+[^bq])*b*$ 

| was replaced with a p and the data above replaced with:

1pbq2bqpbq3.1p3.2bqp4
1|\"2\"|\"3.1|3.2\"|4  # original for comparision

Sample FPAT from GNU awk documentation (FPAT="([^,]*)|(\"[^\"]+\")") was used to generate an FPAT:

FPAT="([^p]*)|(bq([^b]|b+[^bq])*b*bq)"

and a trial was done:

$ gawk 'BEGIN {
    FPAT="([^p]*)|(bq([^b]|b+[^bq])*b*bq)"
    OFS=ORS
}
{
    print $1,$2,$3,$4
}' data

which output:

1
bq2bq
bq3.1p3.2bq
4

which is right. Replacing pqs with |"s in the program produced:

$ gawk 'BEGIN {
    FPAT="([^|]*)|(b\"([^b]|b+[^b\"])*b*b\")"
    OFS=ORS
}
{
    print $1,$2,$3,$4
}' data

outputs:

1
b"2b"
b"3.1|3.2b"
4

which is still right. However, when replacing bs with \s and adding some escaping, resulted in:

(TL;DR how to fix escaping in below script)

$ gawk 'BEGIN {
    FPAT="([^|]*)|(\\\"([^\\]|\\+[^\\\"])*\\*\\\")"
    OFS=ORS
} 
{
    print $1,$2,$3,$4
}' data

and output fails or differs from the previous:

1
\"2\"
\"3.1
3.2\"

so there is probably something wrong with my \\s but after too many try and errs my head is filled with backslashes and all thoughts pretty much escaped (pun intended). And as the community is all about sharing, I thought to share my headache with you guys.

Edit: Apparently it's got something to do with backslashes in quotes, since if instead of defining FPAT="..." I use GNU awk's strongly typed typing FPAT=@/.../ I get the correct output:

$ gawk 'BEGIN {
    FPAT=@/([^|]*)|(\\\"([^\\]|\\+[^\\\"])*\\*\\\")/
    OFS=ORS
} 
{
    print $1,$2,$3,$4
}' data

Output now:

1
\"2\"
\"3.1|3.2\"
4
James Brown
  • 36,089
  • 7
  • 43
  • 59
  • Regarding `Edit: Apparently it's got something to do with backslashes in quotes` - idk if there's any other issue in the way you're escaping things but that is NOT the problem you're having, it's exactly what I said in my answer, that `[^\\\"]` does not mean `not \"`. I tried using the FPAT in your last code segment but got ```awk: tst.awk:2: warning: regexp escape sequence `\"' is not a known regexp operator``` so idk what you meant to post there. – Ed Morton Dec 21 '21 at 18:51
  • Interesting. I've been getting that same warning lately when using `sub(/\"/...)` none of the above segments are giving me that. Feels like it started all of sudden if it'd make any sense. – James Brown Dec 21 '21 at 18:56
  • Again - `[^\\\"]` means `neither the char \ nor the char "` when you need something that means `not the string \"` and such a construct just does not exist in BREs or EREs which is why you have to convert every `\"` to a single char `X` and THEN you can write `[^X]` as in my answer where I use `\n` for `X`. Sure you can get the expected output from the posted sample input using some other regexp but then it'll fail given other input, e.g. input that contains a single ```\``` or single `"` like ```\"foo"bar\here\"``` – Ed Morton Dec 21 '21 at 18:57
  • I understand that. – James Brown Dec 21 '21 at 18:57
  • you SHOULD get that warning from `sub(/\"/...)` since that regexp is either trying to escape a literal character or it's trying to specify a literal ```\``` but forgetting to escape it - in either case the regexp is wrong and it should be `sub(/"/,...)` or `sub(/\\"/,''')` and the tool doesn't understand which you were trying to say so it takes a guess that you wanted the former and warns you it's doing so. – Ed Morton Dec 21 '21 at 18:58
  • Then I'm not sure what that `Edit: Apparently it's got something to do with backslashes in quotes` was for as it's not related to your problem. – Ed Morton Dec 21 '21 at 18:59
  • Yeah, so it seemed. :D My original problem was, that I couldn't get the backslashes working correctly in the `FPAT="..\\.."` where they would work fine in tests, like `awk '{print ++i,$0;while(sub(/([^|]+)|(\\\"([^\\]|\\+[^\\\"])*\\*\\\")/,""))print ++i,$0}'` in which `sub()` seemed to consume the columns correctly. After some more test (not in my `.history` apparently anymore) I decided to test the `@/..\\../` and that working I was amazed of the different outcome (difficulty of simulating nondeterministic computation with a deterministic computer). – James Brown Dec 22 '21 at 07:48
  • 1
    A different outcome regarding number of escapes is to be expected when storing a regexp in a string (`FPAT="...\\..."`) instead of a regexp (`FPAT=@/...\\.../`) because then it gets evaluated twice where it's used, once when it's converted to a regexp and then when it's used as that regexp. That's why you need twice the escapes for `printf 'a\\b\n' | awk '{sub("\\\\","-")}1'` (or `printf 'a\\b\n' | awk '{x="\\\\"; sub(x,"-")}1'`) vs `printf 'a\\b\n' | awk '{sub(/\\/,"-")}1'`. – Ed Morton Dec 22 '21 at 11:44
  • 1
    See https://www.gnu.org/software/gawk/manual/gawk.html#Computed-Regexps for details of how regexps stored as string constants within string delimiters `"..."` (aka dynamic or computed regexps) are handled differently from regexps stored as regexp constants within regexp delimiters `/.../`. – Ed Morton Dec 22 '21 at 11:49
  • 1
    Ah yes, of course. Thanks @EdMorton. And I'm sorry the holidays postponed my reply - and happy new year! – James Brown Jan 11 '22 at 12:09

1 Answers1

1

You seem to be trying to use [^\\\"] to mean not the string \" but it doesn't mean that, it means neither the char \ nor the char ". You need to have a single char to negate in that part of the FPAT regexp so the approach is to convert every \" in the input to a single char that can't be present in the input (I use \n below as that's usually RS but you can use any char that can't be in the record), then split the record into fields, and then restore the \"s before using each individual field:

$ cat tst.awk
BEGIN { FPAT="([^|]*)|(\n[^\n]+\n)" }
{
    gsub(/\\"/,"\n")              # Replace each\" with \n in the record
    $0 = $0                       # Re-split the record into fields
    for (i=1; i<=NF; i++) {
        gsub("\n","\\\"",$i)      # Replace each \n with \" in the field
        print "$"i"=" $i
    }
}

$ awk -f tst.awk file
$1=1
$2=\"2\"
$3=\"3.1|3.2\"
$4=4

If there is no specific char that can't be present in your input then it's easy to manipulate your input such that whatever character you like cannot be present during field splitting (I'm using \n again here but this time it'd work even if your input was multi-line records containing \ns, assuming you set RS appropriately to allow reading of multi-line records):

$ cat tst.awk
BEGIN { FPAT="([^|]*)|(\n[^\n]+\n)" }
{
    gsub(/@/,"@A")
    gsub(/\n/,"@B")
    gsub(/\\"/,"\n")
    $0 = $0
    for (i=1; i<=NF; i++) {
        gsub("\n","\\\"",$i)
        gsub("@B","\n",$i)
        gsub("@A","@",$i)
        print "$"i"=" $i
    }
}

$ awk -f tst.awk file
$1=1
$2=\"2\"
$3=\"3.1|3.2\"
$4=4
Ed Morton
  • 188,023
  • 17
  • 78
  • 185
  • First of all, thank you for such a prompt answer. _- - trying to use `[^\\\"]` to mean `not the string \"` but it doesn't mean that_ Well, that's just a part of the generated regex and it seemed to work in the whole (with `bpq`s) but I couldn't get it working with the correct chars. Then again, I didn't test it any further, yet, aside that one data line in the question, so I really don't know the pitfalls ahead of me. – James Brown Dec 21 '21 at 13:16
  • 1
    It only SEEMED to work with `bpq`, it actually cannot possibly work. Seeing `[^bq]` in that regexp was a big clue to the problem. I suspect whatever you were using to generate the regexp thought that `bq` was a variable holding a character rather than intended to be a 2-character string, but idk. – Ed Morton Dec 21 '21 at 13:19