4

Does gnu awk support POSIX equivalence classes?

Is it possible to match [[=a=]] using awk as it is done in grep?

$ echo ábÅ | grep [[=a=]]
ábÅ

$ echo ábÅ | grep -o [[=a=]]
á
Å
Eugene Barsky
  • 5,780
  • 3
  • 17
  • 40

3 Answers3

5

See here, towards the end:

Locale-specific names for a list of characters that are equal. The name is enclosed between ‘[=’ and ‘=]’. For example, the name ‘e’ might be used to represent all of “e,” “ê,” “è,” and “é.” In this case, ‘[[=e=]]’ is a regexp that matches any of ‘e’, ‘ê’, ‘é’, or ‘è’.

These features are very valuable in non-English-speaking locales.

CAUTION: The library functions that gawk uses for regular expression matching currently recognize only POSIX character classes; they do not recognize collating symbols or equivalence classes.

James Brown
  • 36,089
  • 7
  • 43
  • 59
5

Per the GAWK User's Guide, "Caution: The library functions that gawk uses for regular expression matching currently only recognize POSIX character classes; they do not recognize collating symbols or equivalence classes.".

Accordingly, you're going to have to write-out the allowed equivalents in the regex /[aáÅ]/ or whatever you're looking for.

There are locale-aware character ranges but that doesn't seem to be what you're asking about.

Raymond Hettinger
  • 216,523
  • 63
  • 388
  • 485
0

You'll be surprised what gawk is willing to do these days :

 echo 'eÅêéAEè' \
                 \
 | mawk 'BEGIN { FS=RS="^$"
                   ORS=  ""
   } sub(/[\n]$/,"") +\
    gsub("[ \t]+|[\000-\b\v-\37!-\177]|"\
                 "[\200-\277]+","&\n")'  \
                                          \
 | gtee >( gpaste -s -d':' - | ecp >&2; )  \
                                            \ 
 | LC_ALL=C gawk -b -e '/[=[:lower:]=]|[=Å=]/'
  • e:Å:ê:é:A:E:è

     1   e
     2   Å
     3   ê
     4   é
     5   è
    

Even when I forced both non-multibyte "C" locale as well as using the byte mode flaw in gawk, it's willing to match it at the larger class level. However, it's unwilling to match the ASCII "A" if I only specified just the Scandinavian A-ring.

RARE Kpop Manifesto
  • 2,453
  • 3
  • 11