5

Awk gives me the following error:

awk: illegal primary in regular expression (?<=\>)(.*?)(?=\<) at <=\>)(.*?)(?=\<)
source line number 10 source file transpile.awk
context is
    match($0, >>>  /(?<=\>)(.*?)(?=\<)/) <<< 

But what is an "illegal primary"?

Mike 'Pomax' Kamermans
  • 49,297
  • 16
  • 112
  • 153
303
  • 888
  • 1
  • 11
  • 31
  • 4
    Lookarounds are not supported by `awk`. Hence the syntax error. – revo May 01 '18 at 20:20
  • In POSIX regex, `\>` and `\<` are word boundaries. What did you want to achieve with that pattern of yours? – Wiktor Stribiżew May 01 '18 at 20:29
  • @WiktorStribiżew The input lines are text inside HTML tags. I want to match the text without the tags. – 303 May 01 '18 at 20:57
  • 2
    @WiktorStribiżew Thank you for pointing to a question about matching HTML contents, but my question is not a duplicate. I actually asked what an illegal primary is, because I want to learn about awk and how to read its error messages. Please remove the mark. – 303 May 01 '18 at 21:08

2 Answers2

6

A "primary", in awk parlance, is the basic unit of a regex.

A regex consists of an alternative of (1 or more) branches. Each branch consists of a concatenation of (0 or more) primaries.

A primary is either a normal character (e.g. a), or an escaped special character (e.g. \*), or a character class ([...]), or a dot (.), or an anchor (^ or $), or a parenthesized subexpression ((...)). Most of these can have a quantifier (?, +, *), too.

The problem with your regex is that (?<=\>) parses as ( first, which starts a subgroup. The next item then needs to be a primary. ? is not a valid primary, hence you get an error.

Awk does not support look-ahead or look-behind.

melpomene
  • 84,125
  • 8
  • 85
  • 148
1

If you have a look at the awk source code, you can see that illegal primary is a default FATAL error when none of the cases for the regex tokens matched. It is not a Syntax Error.

Here is the specific code from b.c file of awk (stripped),

    case NCCL:
        np = op2(NCCL, NIL, (Node *) cclenter((char *) rlxstr));
        rtok = relex();
        return (unary(np));
    case '^':
        rtok = relex();
        return (unary(op2(CHAR, NIL, itonp(HAT))));
    case '$':
        rtok = relex();
        return (unary(op2(CHAR, NIL, NIL)));
    case '(':
        rtok = relex();
        if (rtok == ')') {  /* special pleading for () */
            rtok = relex();
            return unary(op2(CCL, NIL, (Node *) tostring("")));
        }
        np = regexp();
        if (rtok == ')') {
            rtok = relex();
            return (unary(np));
        }
        else
            FATAL("syntax error in regular expression %s at %s", lastre, prestr);
    default:
        FATAL("illegal primary in regular expression %s at %s", lastre, prestr);

This code is also available on github here,

https://github.com/onetrueawk/awk/blob/master/b.c

in your specific case the reason is because Lookarounds, i.e, Lookahead: (?=...), (?!...) and Lookbehind: (?<=...), (?<!...) are not supported by awk.

Hope this helps

Sufiyan Ghori
  • 18,164
  • 14
  • 82
  • 110