1

I need to split a field in awk by a "+" character, but not if the character preceding it is a "?". I have put together a basic example of the idea below.

exampleString="Var1, Var2Pt1+Var2Pt2+Var2Pt?+3+Var2Pt4, Var3"

awkOutput=`awk '
BEGIN{FS=",";}
{NUM_PARTS=split($2,SUBFIELDS,"+");
    for (i = 1; i<=NUM_PARTS; i++) {
        print SUBFIELDS[i]","
    }
}
' <<<"$exampleString"`

echo $awkOutput;

Current Output:

Var2Pt1, Var2Pt2, Var2Pt?, 3, Var2Pt4,

Desired Output:

Var2Pt1, Var2Pt2, Var2Pt+3, Var2Pt4,

Is there some way to deal with this using split, or is there another means of achieving this in an elegant manner? The need for this has arisen in part of a very large awk script - so the simpler the solution the better!

Thanks.

Just an update - I did not make explicit that the "?" is to be treated as an escape character in my original question - i.e. it is not desired on the output.

paul frith
  • 551
  • 2
  • 4
  • 21
  • 1
    Don't use all upper case for user-defined awk variables to avoid clashes with builtin variable names and so it's clear that your variables aren't built-in ones. – Ed Morton Mar 21 '23 at 14:08
  • In the desired output, shouldn't it be `Var2Pt?+3` instead of `Var2Pt+3`? – user1934428 Mar 21 '23 at 14:12
  • 1
    _part of a very large awk script_ .... Which awk are you using? Wouldn't it make more sense to implement this in a language which as _negatie lookbehind regexes_, as discussed [here](https://stackoverflow.com/questions/9306202/regex-for-matching-something-if-it-is-not-preceded-by-something-else#9306228), such as Python or Ruby? – user1934428 Mar 21 '23 at 14:21
  • @user1934428 no, the desired output should remove the escape character which has been set as "?". Regards uses of other scripting, this is about expedience - I would like to make an amendment to something that already exists, rather than rewrite - though yes another language might be the better ultimate route. – paul frith Mar 21 '23 at 14:44

3 Answers3

1

The simple way is to convert every ?+ to something that can't appear in the current record (e.g. the Record Separator), then do the split() on +, then convert it back:

exampleString="Var1, Var2Pt1+Var2Pt2+Var2Pt?+3+Var2Pt4, Var3"

awkOutput=`awk '
BEGIN{FS=",";}
{
    gsub(/\?\+/,RS)
    NUM_PARTS=split($2,SUBFIELDS,"+");
    for (i = 1; i<=NUM_PARTS; i++) {
        gsub(RS,"?+",SUBFIELDS[i])
        print SUBFIELDS[i]","
    }
}
' <<<"$exampleString"`

echo $awkOutput;
Var2Pt1, Var2Pt2, Var2Pt?+3, Var2Pt4,

If you really do want ?+ from the input to become + in the output, as shown in the example in your question, then obviously just change gsub(RS,"?+",SUBFIELDS[i]) to gsub(RS,"+",SUBFIELDS[i]).

Ed Morton
  • 188,023
  • 17
  • 78
  • 185
  • 1
    Thank you - I've marked this as the answer, but I should mention to get my desired output the second gsub would be `gsub(RS,"+",SUBFIELDS[i])`. – paul frith Mar 21 '23 at 14:48
  • @paulfrith You're welcome. I already said that in the last paragraph of my answer. – Ed Morton Mar 21 '23 at 22:29
1

Just to note, this type of escaping is typically handled by CSV parsers automatically. awk may not be the best choice for this task.

For example, in Python you could write

>>> import io, csv
>>> exampleString = "Var1, Var2Pt1+Var2Pt2+Var2Pt?+3+Var2Pt4, Var3"
>>> filelike = io.StringIO(exampleString.split(', ')[1])
>>> list(csv.reader(filelike, delimiter='+', escapechar='?'))
[['Var2Pt1', 'Var2Pt2', 'Var2Pt+3', 'Var2Pt4']]
chepner
  • 497,756
  • 71
  • 530
  • 681
  • That's very nice - I definitely will look at porting across when I have more time - the file being imported is very complex, and not actually csv (I just used that for my example) - but I can already see from what you have shown here that this might be a neater route for the future. – paul frith Mar 21 '23 at 16:17
0

UPDATE 1 :

gawk 5.2.1 is perfectly happy accepting FS = "?+(+)" due to their very unique interpretation of a regex comprised of nothing except 3 duplication meta-characters and one grouping parenthesis pair. That said, most other dialects of ERE don't treat this regex as valid at all, while posix's official text discourages them :

 Implementations are permitted to extend the language to allow 
 these. Strictly Conforming applications cannot use such constructs.

In fact, gawk even accepted something ridiculous like this while returning the same expected output :

 FS = "?++++++++++++++(+)+++++++++++++++++++"

========================

__='Var1, Var2Pt1+Var2Pt2+Var2Pt?+3+Var2Pt4, Var3'

printf '%s' "$__"
Var1, Var2Pt1+Var2Pt2+Var2Pt?+3+Var2Pt4, Var3
printf '%s' "$__" | 

mawk 'gsub( ", |[+]", ",\f", $!(NF=NF)) + \
      gsub(OFS, "+")'     FS='[?][+]' OFS='\61\277\761'
Var1,
     Var2Pt1,
             Var2Pt2,
                     Var2Pt+3,
                              Var2Pt4,
                                      Var3
  • OFS is a UTF-8-invalid byte sequence, which makes it a safe choice for a placeholder or sep.
 Fun trivia regarding the octals that form the `OFS` ::
  • for the latter 2 bytes, each has an underlying decimal number that is prime (191 | 241).

  • If directly interpreting all 3 octals as decimal numbers (i.e. \277 —> 277-b10 ), each form a prime number (61 | 277 | 761).

  • Collectively, it also forms a 6th prime number (61277761)

  • but when horizontally digit reversed, it's simply 8^8

RARE Kpop Manifesto
  • 2,453
  • 3
  • 11