TL (see TL;DR near the end of the question)
I came about this data with pipes as field delimiters (|
) and backslash-quote pairs as quotes (\"
) to fields with delimiters in the data, such as:
1|\"2\"|\"3.1|3.2\"|4 # basically 1, 2, 3.1|3.2, 4
that is (in awk):
$1==1
$2==\"2\"
$3==\"3.1|3.2\"
$4==4
I decided to try and use GNU awk's FPAT to solve the field issue since writing a negative match regex to \"
didn't seem that bad.
I came about this answer to Regular expression to match a line that doesn't contain a word with a link to (an offsite link) an online generator of negative regular expressions given an input phrase.
As the generator supports only alphanumeric and space characters currently, \"
(backslash-quote) was replaced with bq
and the generator provided regex:
^([^b]|b+[^bq])*b*$
|
was replaced with a p
and the data above replaced with:
1pbq2bqpbq3.1p3.2bqp4
1|\"2\"|\"3.1|3.2\"|4 # original for comparision
Sample FPAT
from GNU awk documentation (FPAT="([^,]*)|(\"[^\"]+\")"
) was used to generate an FPAT
:
FPAT="([^p]*)|(bq([^b]|b+[^bq])*b*bq)"
and a trial was done:
$ gawk 'BEGIN {
FPAT="([^p]*)|(bq([^b]|b+[^bq])*b*bq)"
OFS=ORS
}
{
print $1,$2,$3,$4
}' data
which output:
1
bq2bq
bq3.1p3.2bq
4
which is right. Replacing pq
s with |"
s in the program produced:
$ gawk 'BEGIN {
FPAT="([^|]*)|(b\"([^b]|b+[^b\"])*b*b\")"
OFS=ORS
}
{
print $1,$2,$3,$4
}' data
outputs:
1
b"2b"
b"3.1|3.2b"
4
which is still right. However, when replacing b
s with \
s and adding some escaping, resulted in:
(TL;DR how to fix escaping in below script)
$ gawk 'BEGIN {
FPAT="([^|]*)|(\\\"([^\\]|\\+[^\\\"])*\\*\\\")"
OFS=ORS
}
{
print $1,$2,$3,$4
}' data
and output fails or differs from the previous:
1
\"2\"
\"3.1
3.2\"
so there is probably something wrong with my \\
s but after too many try and errs my head is filled with backslashes and all thoughts pretty much escaped (pun intended). And as the community is all about sharing, I thought to share my headache with you guys.
Edit: Apparently it's got something to do with backslashes in quotes, since if instead of defining FPAT="..."
I use GNU awk's strongly typed typing FPAT=@/.../
I get the correct output:
$ gawk 'BEGIN {
FPAT=@/([^|]*)|(\\\"([^\\]|\\+[^\\\"])*\\*\\\")/
OFS=ORS
}
{
print $1,$2,$3,$4
}' data
Output now:
1
\"2\"
\"3.1|3.2\"
4