Should we consider using range [a-z] as a bug?

Question

In my locale (et_EE) [a-z] means:

abcdefghijklmnopqrsšz

So, 6 ASCII chars (tuvwxy) and one from Estonian alphabet (ž) are not included. I see a lot modules which are still using regexes like

/\A[0-9A-Z_a-z]+\z/

For me it seems wrong way to define range of ASCII alphanumeric chars and i think it should be replaced with:

/\A\p{PosixAlnum}+\z/

Is the first one still considered idiomatic way? Or accepted solution? Or a bug?

Or has last one some caveats?

Using `[a-z]` isn't itself a bug, but could be the cause of one (or several). — Anthony Grist, Aug 12 '12 at 20:19
Perhaps *design flaw* is a better turn of phrase, but yes, that regular expression could lead to bugs in any software that is intended to ever support more than one locale. (Aside: This question goes far beyond perl, so I would suggest removing the `perl` tag.) — kojiro, Aug 12 '12 at 20:21
Should this really be tagged regex? I probably don't think so. — , Aug 13 '12 at 03:13
Perl does not interpret `[a-z]` according to your locale. Just use things like `\pL` and `\w` and `\p{alpha}` as appropriate. — tchrist, Aug 13 '12 at 03:40
How are you determining that `[a-z]` includes `š` but not `tuvwxy`? It shouldn't, character classes are defined in terms of code point order not any locale-specific collation. — bobince, Aug 13 '12 at 08:49
[`perldoc perlre` - How can I match a locale-smart version of /\[a-zA-Z\]/?](http://perldoc.perl.org/perlfaq6.html#How-can-I-match-a-locale-smart-version-of-%2f%5ba-zA-Z%5d%2f%3f) — Zaid, Aug 14 '12 at 08:56
@tchrist, @bobince: i am sure i had problems were `[a-z]` in my locale lead me to problems, but after you refer it as not local dependent, i can't reproduce the problem. I'm confused, but seems you are right ;) — w.k, Aug 14 '12 at 11:37
@Zaid That FAQ is a bit off. It isn’t actually locale-specific to use things like `\p{alpha}`. — tchrist, Aug 14 '12 at 12:17
@w.k I'm betting you had problems with awk or grep, not with perl — eis, Jul 14 '17 at 21:16

eis · Answer 1 · 2021-10-04T06:53:49.617

As this question goes beyond Perl, I was interested to find out how it goes in general. Testing this on popular programming languages with native regular expression support, Perl, PHP, Python, Ruby, Java and Javascript, the conclusions are:

[a-z] will match ASCII-7 a-z range in each and every one of those languages, always, and locale settings do not impact it in any way. Characters like ž and š are never matched.
\w might or might not match characters ž and š, depending on the programming language and parameters given on creation of regular expression. For this expression the variety is greatest, as in some languages they are never matched, irrelevant of options, in others they are always matched and in some it depends.
POSIX [[:alpha:]] and Unicode \p{Alpha} and \p{L}, if they are supported by the regular expression system of the programming language in question and appropriate configuration is used, will match characters like like ž and š.

Note that "Appropriate configuration" did not require locale change: change of locale did not have impact on results in any of the tested systems.

To be on the safe side, I also tested command line Perl, grep and awk. From there, command line Perl behaves identically to the above. However, the versions of grep and awk I had seem to have behaviour different from others in that for them, locale matters also for [a-z]. Behaviour is version and implementation specific, and latest versions of these tools do not exhibit the same behaviour.

In that context - grep, awk or similar command line tools - I'd agree that using a-z range without the defintion of locale could be considered a bug as you can't really know what you end up with.

If we go to more details per language, the status seems to be:

Java

In java, \p{Alpha} works like [a-z] if unicode class is not specified, and unicode alphabetic character if it is, matching ž. \w will match characters like ž if unicode flag is present and not if it is not, and \p{L} will match regardless of unicode flag. There are no locale-aware regular expressions or support for [[alpha]].

PHP

In PHP \w, [[:alpha:]] and \p{L} will match characters like ž if unicode switch is present, and not if it is not. \p{Alpha} is not supported. Locale has no impact on regular expressions.

Python

\w will match mentioned characters if unicode flag is present and locale flag is not present. For unicode strings, unicode flag is assumed by default if Python 3 is used, but not with Python 2. Unicode \p{Alpha}, \p{L} or POSIX [[:alpha:]] are not supported in Python.

The modifier to use locale-specific regular expressions apparently only works for character sets with 1 byte per character, making it unusable for unicode.

Perl

\w matches previously mentioned characters in addition to matching [a-z]. Unicode \p{Letter}, \p{Alpha} and POSIX [[:alpha:]] are supported and work as expected. Unicode and locale flags for regular expression didn't change the results, and neither did change of locale or use locale;/no locale;.

Behavour does not change if we run tests using commandline Perl.

Ruby

[a-z] and \w detect just the characters [a-z], irrelevant of options. Unicode \p{Letter}, \p{Alpha} and POSIX [[:alpha:]] are supported and working as expected. Locale does not have impact.

Javascript

[a-z] and \w always detects just the characters [a-z]. There is support for /u unicode switch in ECMA2015, which is mostly supported by major browsers, but it does not bring support for [[:alpha:]], \p{Alpha} or \p{L} or change the behaviour of \w. The unicode switch does add treatment of unicode characters as one character, which has been a problem before.

Situation is the same for client-side javascript as well as Node.js.

AWK

For AWK, there is a longer description of the status posted in article A.8 Regexp Ranges and Locales: A Long Sad Story. It details that in the old world of unix tools, [a-z] was the correct way to detect lowercase letters and this is how the tools of the time worked. However, 1992 POSIX introduced locales, and changed the interpretation of character classes so that order of characters was defined per collation order, binding it to locale. This was adopted by AWK of the time as well (3.x series), which resulted in several issues. When 4.x series was developed, POSIX 2008 had defined the order to be undefined, and maintainer reverted back to original behaviour.

Nowadays mostly 4.x version of AWK is used. When that is used, [a-z] matches a-z ignoring any locale changes, and \w and [[:alpha:]] will match locale-specific characters. Unicode \p{Alpha} and \p{L} are not supported.

grep

Grep (as well as sed, ed) use GNU Basic Regular Expressions, which is an old flavor. It doesn't have support for unicode character classes.

At least gnu grep 2.16 and 2.25 seems to follow 1992 posix in that locale matters also for [a-z], as well as for \w and [[:alpha:]]. This means for example that [a-z] only matches z in set xuzvöä if estonian locale is used. This behaviour does not impact older or newer versions of gnu grep, though I am not sure which versions exactly changed the behaviour.

Test code used listed below for each language.

Java (1.8.0_131)

import java.util.regex.*;
import java.util.Locale;

public class RegExpTest {
    public static void main(String args[]) {
        verify("v", 118);
        verify("š", 353);
        verify("ž", 382);

        tryWith("v");
        tryWith("š");
        tryWith("ž");
    }
    static void tryWith(String input) {
        matchWith("[a-z]", input);
        matchWith("\\w", input);
        matchWith("\\p{Alpha}", input);
        matchWith("\\p{L}", input);
        matchWith("[[:alpha:]]", input);
    }

    static void matchWith(String pattern, String input) {
        printResult(Pattern.compile(pattern), input);
        printResult(Pattern.compile(pattern, Pattern.UNICODE_CHARACTER_CLASS), input);
    }
    static void printResult(Pattern pattern, String input) {
        System.out.printf("%s\t%03d\t%5s\t%-10s\t%-10s\t%-5s%n",
          input, input.codePointAt(0), Locale.getDefault(),
          specialFlag(pattern.flags()),
          pattern, pattern.matcher(input).matches());
    }
    static String specialFlag(int flags) {
      if ((flags & Pattern.UNICODE_CHARACTER_CLASS) == Pattern.UNICODE_CHARACTER_CLASS) {
          return "UNICODE_FLAG";
      }
      return "";
    }
    static void verify(String str, int code) {
        if (str.codePointAt(0) != code) {
            throw new RuntimeException("your editor is not properly configured for this character: " + str);
        }
    }
}

PHP (7.1.5)

<?php
/*
PHP, even with 7, only has binary strings that can be operated with unicode-aware
functions, if needed. So functions operating them need to be told which charset to use.

When there is encoding assumed and not specified, PHP defaults to ISO-8859-1.
*/


// PHP7 and extension=php_intl.dll enabled in PHP.ini is needed for IntlChar class
function codepoint($char) {
  return IntlChar::ord($char);
}

function verify($inputp, $code) {
  if (codepoint($inputp) != $code) {
    throw new Exception(sprintf('Your editor is not configured correctly for %s (result %s, should be %s)',
      $inputp, codepoint($inputp), $code));
  }
}

$rowindex = 0;
$origlocale = getlocale();

verify('v', 118);
verify('š', 353); // https://en.wikipedia.org/wiki/%C5%A0#Computing_code
verify('ž', 382); // https://en.wikipedia.org/wiki/%C5%BD#Computing_code

function tryWith($input) {
  matchWith('[a-z]', $input);
  matchWith('\\w', $input);
  matchWith('[[:alpha:]]', $input); // POSIX, http://www.regular-expressions.info/posixbrackets.html
  matchWith('\p{L}', $input);
}
function matchWith($pattern, $input) {
  global $origlocale;
  selectLocale($origlocale);
  printResult("/^$pattern\$/", $input);
  printResult("/^$pattern\$/u", $input);
  selectLocale('C'); # default (root) locale
  printResult("/^$pattern\$/", $input);
  printResult("/^$pattern\$/u", $input);
  selectLocale(['et_EE', 'et_EE.UTF-8', 'Estonian_Estonia.1257']);
  printResult("/^$pattern\$/", $input);
  printResult("/^$pattern\$/u", $input);
  selectLocale($origlocale);
}
function selectLocale($locale) {
  if (!is_array($locale)) {
    $locale = [$locale];
  }
  // On Windows, no UTF-8 locale can be set
  // https://stackoverflow.com/a/16120506/365237
  // https://msdn.microsoft.com/en-us/library/x99tb11d.aspx
  // Available Windows locales
  // https://docs.moodle.org/dev/Table_of_locales
  $retval = setlocale(LC_ALL, $locale);
  //printf("setting locale %s, retval was %s\n", join(',', $locale), $retval);
  if ($retval === false || $retval === null) {
    throw new Exception(sprintf('Setting locale %s failed', join(',', $locale)));
  }
}
function getlocale() {
  return setlocale(LC_ALL, 0);
}
function printResult($pattern, $input) {
  global $rowindex;
  printf("%2d: %s\t%03d\t%-20s\t%-25s\t%-10s\t%-5s\n",
        $rowindex, $input, codepoint($input), getlocale(),
        specialFlag($pattern), 
        $pattern, (preg_match($pattern, $input) === 1)?'true':'false');
  $rowindex = $rowindex + 1;
}
function specialFlag($pattern) {
  $arr = explode('/',$pattern);
  $lastelem = array_pop($arr);
  if (strpos($lastelem, 'u') !== false) {
    return 'UNICODE';
  }
  return '';
}

tryWith('v');
tryWith('š');
tryWith('ž');

Python (3.5.3)

# -*- coding: utf-8 -*-

# with python, there are two strings: unicode strings and regular ones.
# when you use unicode strings, regular expressions also take advantage of it,
# so no need to tell that separately. However, if you want to be using specific
# locale, that you need to tell.

# Note that python3 regexps defaults to unicode mode if unicode regexp string is used,
# python2 does not. Also strings are unicode strings in python3 by default.

# summary: [a-z] is always [a-z], \w will match if unicode flag is present and
# locale flag is not present, no unicode \p{Letter} or POSIX :alpha: exists.
# Letters outside ascii-7 never match \w if locale-specific
# regexp is used, as it only supports charsets with one byte per character
# (https://lists.gt.net/python/python/850772).

# Note that in addition to standard https://docs.python.org/3/library/re.html, more
# complete https://pypi.python.org/pypi/regex/ third-party regexp library exists.

import re, locale

def verify(inputp, code):
  if (ord(inputp[0]) != code):
    raise Exception('Your editor is not configured correctly for %s (result %s)' % (inputp, ord(inputp[0])))
  return

rowindex = 0
origlocale = locale.getlocale(locale.LC_ALL)  

verify(u'v', 118)
verify(u'š', 353)
verify(u'ž', 382)

def tryWith(input):
  matchWith(u'[a-z]', input)
  matchWith(u'\\w', input)

def matchWith(pattern, input):
  global origlocale
  locale.setlocale(locale.LC_ALL, origlocale)
  printResult(re.compile(pattern), input)
  printResult(re.compile(pattern, re.UNICODE), input)
  printResult(re.compile(pattern, re.UNICODE | re.LOCALE), input)

  matchWith2(pattern, input, 'C') # default (root) locale
  matchWith2(pattern, input, 'et_EE')
  matchWith2(pattern, input, 'et_EE.UTF-8')
  matchWith2(pattern, input, 'Estonian_Estonia.1257') # Windows locale
  locale.setlocale(locale.LC_ALL, origlocale)

def matchWith2(pattern, input, localeParam):
  try:
    locale.setlocale(locale.LC_ALL, localeParam) # default (root) locale
    printResult(re.compile(pattern), input)
    printResult(re.compile(pattern, re.UNICODE), input)
    printResult(re.compile(pattern, re.UNICODE | re.LOCALE), input)
  except locale.Error:
    print("Locale %s not supported on this platform" % localeParam)

def printResult(pattern, input):
  global rowindex
  try:
    print("%2d: %s\t%03d\t%-20s\t%-25s\t%-10s\t%-5s" % \
          (rowindex, input, ord(input[0]), locale.getlocale(), \
          specialFlag(pattern.flags), \
          pattern.pattern, pattern.match(input) != None))
  except UnicodeEncodeError:
    print("%2d: %s\t%03d\t%-20s\t%-25s\t%-10s\t%-5s" % \
          (rowindex, '?', ord(input[0]), locale.getlocale(), \
          specialFlag(pattern.flags), \
          pattern.pattern, pattern.match(input) != None))
  rowindex = rowindex + 1      

def specialFlag(flags):
  ret = []
  if ((flags & re.UNICODE) == re.UNICODE):
    ret.append("UNICODE_FLAG")
  if ((flags & re.LOCALE) == re.LOCALE):
    ret.append("LOCALE_FLAG")
  return ','.join(ret)

tryWith(u'v')
tryWith(u'š')
tryWith(u'ž')

Perl (v5.22.3)

# Summary: [a-z] is always [a-z], \w always seems to recognize given test chars and
# unicode \p{Letter}, \p{Alpha} and POSIX :alpha: are supported.
# Unicode and locale flags for regular expression didn't matter in this use case.

use warnings;
use strict;
use utf8;
use v5.14;
use POSIX qw(locale_h);
use Encode;
binmode STDOUT, "utf8";

sub codepoint {
  my $inputp = $_[0];
  return unpack('U*', $inputp);
}
sub verify {
  my($inputp, $code) = @_;
  if (codepoint($inputp) != $code) {
    die sprintf('Your editor is not configured correctly for %s (result %s)', $inputp, codepoint($inputp))
  }
}

sub getlocale {
  return setlocale(LC_ALL);
}
my $rowindex = 0;
my $origlocale = getlocale();

verify('v', 118);
verify('š', 353);
verify('ž', 382);

# printf('orig locale is %s', $origlocale);

sub tryWith {
  my ($input) = @_;
  matchWith('[a-z]', $input);
  matchWith('\w', $input);
  matchWith('[[:alpha:]]', $input);
  matchWith('\p{Alpha}', $input);
  matchWith('\p{L}', $input);
}

sub matchWith {
  my ($pattern, $input) = @_;
  my @locales_to_test = ($origlocale, 'C','C.UTF-8', 'et_EE.UTF-8', 'Estonian_Estonia.UTF-8');
  for my $testlocale (@locales_to_test) {
    use locale;
    # printf("Testlocale %s\n", $testlocale);
    setlocale(LC_ALL, $testlocale);
    printResult($pattern, $input, '');
    printResult($pattern, $input, 'u');
    printResult($pattern, $input, 'l');
    printResult($pattern, $input, 'a');
   };
  no locale;
  setlocale(LC_ALL, $origlocale);
  printResult($pattern, $input, '');
  printResult($pattern, $input, 'u');
  printResult($pattern, $input, 'l');
  printResult($pattern, $input, 'a');
}


sub printResult{
  no warnings 'locale';
              # for this test, as we want to be able to test non-unicode-compliant locales as well
              # remove this for real usage

  my ($pattern, $input, $flags) = @_;
  my $regexp = qr/$pattern/;
  $regexp = qr/$pattern/u if ($flags eq 'u');
  $regexp = qr/$pattern/l if ($flags eq 'l');
  printf("%2d: %s\t%03d\t%-20s\t%-25s\t%-10s\t%-5s\n", 
        $rowindex, $input, codepoint($input), getlocale(),
        $flags, $pattern, (($input =~ $regexp) ? 'true':'false'));
  $rowindex = $rowindex + 1;
}

tryWith('v');
tryWith('š');
tryWith('ž');

Ruby (ruby 2.2.6p396 (2016-11-15 revision 56800) [x64-mingw32])

# -*- coding: utf-8 -*-

# Summary: [a-z] and \w are always [a-z], unicode \p{Letter}, \p{Alpha} and POSIX
# :alpha: are supported. Locale does not have impact.

# Ruby doesn't seem to be able to interact very well with locale without 'locale'
# rubygem (https://github.com/mutoh/locale), so that is used.

require 'rubygems'
require 'locale'

def verify(inputp, code)
  if (inputp.unpack('U*')[0] != code)
    raise Exception, sprintf('Your editor is not configured correctly for %s (result %s)', inputp, inputp.unpack('U*')[0])
  end
end

$rowindex = 0
$origlocale = Locale.current
$origcharmap = Encoding.locale_charmap

verify('v', 118)
verify('š', 353)
verify('ž', 382)

# printf('orig locale is %s.%s', $origlocale, $origcharmap)
def tryWith(input)
  matchWith('[a-z]', input)
  matchWith('\w', input)
  matchWith('[[:alpha:]]', input)
  matchWith('\p{Alpha}', input)
  matchWith('\p{L}', input)
end  

def matchWith(pattern, input)
  locales_to_test = [$origlocale, 'C', 'et_EE', 'Estonian_Estonia']
  for testlocale in locales_to_test
    Locale.current = testlocale
    printResult(Regexp.new(pattern), input)
    printResult(Regexp.new(pattern.force_encoding('utf-8'),Regexp::FIXEDENCODING), input)
  end
  Locale.current = $origlocale
end

def printResult(pattern, input)
  printf("%2d: %s\t%03d\t%-20s\t%-25s\t%-10s\t%-5s\n", 
        $rowindex, input, input.unpack('U*')[0], Locale.current,
        specialFlag(pattern),
        pattern, !pattern.match(input).nil?)
  $rowindex = $rowindex + 1
end

def specialFlag(pattern)
  return pattern.encoding
end

tryWith('v')
tryWith('š')
tryWith('ž')

Javascript (node.js) (v6.10.3)

function match(pattern, input) {
    try {
        var re = new RegExp(pattern, "u");
        return input.match(re) !== null;
    } catch(e) {
        return 'unsupported';
    }
}
function regexptest() {
    var chars = [
        String.fromCodePoint(118),
        String.fromCodePoint(353),
        String.fromCodePoint(382)
    ];
    for (var i = 0; i < chars.length; i++) {
        var char = chars[i];
        console.log(
            char
            +'\t'
            + char.codePointAt(0)
            +'\t'
            +(match("[a-z]", char))
            +'\t'
            +(match("\\w", char))
            +'\t'
            +(match("[[:alpha:]]", char))
            +'\t'
            +(match("\\p{Alpha}", char))
            +'\t'
            +(match("\\p{L}", char))
            );
    }
}

regexptest();

Javascript (web browsers)

function match(pattern, input) {
    try {
        var re = new RegExp(pattern, "u");
        return input.match(re) !== null;
    } catch(e) {
        return 'unsupported';
    }
}
window.onload = function() {
    var chars = [
        String.fromCodePoint(118),
        String.fromCodePoint(353),
        String.fromCodePoint(382)
    ];
    for (var i = 0; i < chars.length; i++) {
        var char = chars[i];
        var table = document.getElementById('results');
        table.innerHTML += 
            '<tr><td>' + char
            +'</td><td>'
            + char.codePointAt(0)
            +'</td><td>'
            +(match("[a-z]", char))
            +'</td><td>'
            +(match("\\w", char))
            +'</td><td>'
            +(match("[[:alpha:]]", char))
            +'</td><td>'
            +(match("\\p{Alpha}", char))
            +'</td><td>'
            +(match("\\p{L}", char))
            +'</td></tr>';
    }
}

table {
    border-collapse: collapse;
}
table td, table th {
    border: 1px solid black;
}
table tr:first-child th {
    border-top: 0;
}
table tr:last-child td {
    border-bottom: 0;
}
table tr td:first-child,
table tr th:first-child {
    border-left: 0;
}
table tr td:last-child,
table tr th:last-child {
    border-right: 0;
}

<!DOCTYPE html> 
<html>
<head>
    <meta charset="utf-8" /> 
</head>
<body>
    <table id="results">
    <tr>
        <td>char</td>
        <td>codepoint</td>
        <td>[a-z]</td>
        <td>\w</td>
        <td>[[:alpha:]]</td>
        <td>\p{Alpha}</td>
        <td>\p{L}</td>
    </tr>
    </table>
</body>
</html>

AWK (GNU Awk 4.1.3)

$ echo "xyzvöä" | LC_ALL=C awk '{match($0,"[a-z]+",a)}END{print a[0]}'
xyzv
$ echo "xyzvöä" | LC_ALL=et_EE.utf8 awk '{match($0,"[a-z]+",a)}END{print a[0]}'
xyzv
$ echo "xyzvöä" | LC_ALL=C awk '{match($0,"\\w+",a)}END{print a[0]}'
xyzv
$ echo "xyzvöä" | LC_ALL=et_EE.utf8 awk '{match($0,"\\w+",a)}END{print a[0]}'
xyzvöä
$ echo "xyzvöä" | LC_ALL=C awk '{match($0,"[[:alpha:]]+",a)}END{print a[0]}'
xyzv
$ echo "xyzvöä" | LC_ALL=et_EE.utf8 awk '{match($0,"[[:alpha:]]+",a)}END{print a[0]}'
xyzvöä

AWK (GNU Awk 3.1.8)

$ echo "xyzvöä" | LC_ALL=C awk '{match($0,"[a-z]+",a)}END{print a[0]}'
xyzv
$ echo "xyzvöä" | LC_ALL=et_EE.utf8 awk '{match($0,"[a-z]+",a)}END{print a[0]}'
z
$ echo "xyzvöä" | LC_ALL=C awk '{match($0,"\\w+",a)}END{print a[0]}'
xyzv
$ echo "xyzvöä" | LC_ALL=et_EE.utf8 awk '{match($0,"\\w+",a)}END{print a[0]}'
xyzvöä
$ echo "xyzvöä" | LC_ALL=C awk '{match($0,"[[:alpha:]]+",a)}END{print a[0]}'
xyzv
$ echo "xyzvöä" | LC_ALL=et_EE.utf8 awk '{match($0,"[[:alpha:]]+",a)}END{print a[0]}'
xyzvöä

grep (GNU grep 2.10, GNU grep 3.4)

$ echo xuzvöä | LC_ALL=C grep [a-z]
xuzvöä
$ echo xuzvöä | LC_ALL=et_EE.utf8 grep [a-z]
xuzvöä
$ echo xuzvöä | LC_ALL=C grep [[:alpha:]]
xuzvöä
$ echo xuzvöä | LC_ALL=et_EE.utf8 grep [[:alpha:]]
xuzvöä
$ echo xuzvöä | LC_ALL=C grep \\w
xuzvöä
$ echo xuzvöä | LC_ALL=et_EE.utf8 grep \\w
xuzvöä

grep (GNU grep 2.16, GNU grep 2.25)

$ echo xuzvöä | LC_ALL=C grep [a-z]
xuzvöä
$ echo xuzvöä | LC_ALL=et_EE.utf8 grep [a-z]
xuzvöä
$ echo xuzvöä | LC_ALL=C grep [[:alpha:]]
xuzvöä
$ echo xuzvöä | LC_ALL=et_EE.utf8 grep [[:alpha:]]
xuzvöä
$ echo xuzvöä | LC_ALL=C grep \\w
xuzvöä
$ echo xuzvöä | LC_ALL=et_EE.utf8 grep \\w
xuzvöä

score 8 · Answer 2 · edited Aug 27 '12 at 03:37

Back in the old Perl 3.0 days, everything was ASCII, and Perl reflected that. \w meant the same thing as [0-9A-Z_a-z]. And, we liked it!

However, Perl is no longer bound to ASCII. I've stopped using [a-z] a while ago because I got yelled at when programs I wrote didn't work with languages that weren't English. You must have imagined my surprise as an American to discover that there are at least several thousand people in this world who don't speak English.

Perl has better ways of handling [0-9A-Z_a-z] anyway. You can use the [[:alnum:]] set or simply use \w which should do the right thing. If you must only have lowercase characters, you can use [[:lower:]] instead of [a-z] (Which assumes an English type of language). (Perl goes to some lengths to get [a-z] mean just the 26 characters a, b, c, ... z even on EBCDIC platforms.)

If you need to specify ASCII only, you can add the /a qualifier. If you mean locale specific, you should compile the regular expression within the lexical scope of a 'use locale'. (Avoid the /l modifier, as that applies only to the regular expression pattern, and nothing else. For example in 's/[[:lower:]]/\U$&/lg', the pattern is compiled using locale, but the \U is not. This probably should be considered a bug in Perl, but it is the way things currently work. The /l modifier is really only intended for internal bookkeeping, and should not be typed-in directly.) Actually, it is better to translate your locale data upon input into the program, and translate it back on output, while using Unicode internally. If your locale is one of the new-fashioned UTF-8 ones, a new feature in 5.16 'use locale ":not_characters"' is available to allow the other portions of your locale work seamlessly in Perl.

$word =~ /^[[:alnum:]]+$/   # $word contains only Posix alphanumeric characters.
$word =~ /^[[:alnum:]]+$/a  # $word contains only ASCII alphanumeric characters.
{ use locale;
  $word =~ /^[[:alnum:]]+$/;# $word contains only alphanum characters for your locale
}

Now, is this a bug? If the program doesn't work as intended, it is a bug plain and simple. If you really want the ASCII sequence, [a-z], then the programmer should have used [[:lower:]] with the /a qualifier. If you want all possible lowercase characters including those in other languages, you should simply use [[:lower:]].

Actually, that is not “only POSIX alphanumeric characters”. Believe it or not, it is equivalent to `\p{XPosixAlnum}` not to `\p{PosixAlnum}`. Why anyone would use the clunky POSIX syntax instead of character properties is beyond me. Stay away from `/l`. Decode things to Unicode; do not leave in undecoded. — tchrist, Aug 13 '12 at 03:17
Isn't `/[\p{PosixAlnum}]/` the same as `/[[:alnum:]]/l` and `/[\p{XPosixAlnum}]/` the same as `/[[:alnum:]]/`? Why avoid the `/l`? I always thought it's a good way to ensure that the characters being passed are part of your local character set. For example, the Russian letter Ro looks just like the Latin letter `P`, so PNCBank.com could be the website of that large American bank based in Pittsburgh, PA, or it could be starting with a Ro and be some sort of scam. In OP's case, `/[[:lower]]/l` might be what they need. It'll match Estonian, but not Russian characters. — David W., Aug 13 '12 at 13:07
@tchrist Unicode completely confuses me, and, I'm far from alone in this respect. Most of the developers I meet admit that they're winging this Unicode mess. We have some vague concepts, and we know what we've done in the past, the sort of stuff that sort of works and stuff that doesn't. We've all scanned Wikipedia and our respective programming docs for help. Is there a good resource that can explain Unicode to the perplexed? Something that's a guide and not a mere reference manual -- a good _Llama book_ guide to Unicode. — David W., Aug 13 '12 at 13:16
You want to avoid `/l` because it has a code smell: you forgot to decode. Did you read v4 Camel’s new Unicode chapter and related material? — tchrist, Aug 13 '12 at 13:51
I have editions the first three editions of the Camel book. My second edition is even signed by you and Larry. I now see there's a fourth edition of the book out in February, and it includes an entire chapter on unicode. Time to break out the credit card and explain to my wife why I'm buying yet another version of the same book. — David W., Aug 13 '12 at 19:12

score 5 · Answer 3 · answered Aug 12 '12 at 20:42

Possible Locale Bugs

The problem you're facing is not with POSIX character classes per se, but with the fact that the classes are dependent on locale. For example, regex(7) says:

Within a bracket expression, the name of a character class enclosed in "[:" and ":]" stands for the list of all characters belonging to that class...These stand for the character classes defined in wctype(3). A locale may provide others.

The emphasis is mine, but the manual page is clearly saying that the character classes are dependent on locale. Further, wctype(3) says:

The behavior of wctype() depends on the LC_CTYPE category of the current locale.

In other words, if your locale incorrectly defines a character class, then it's a bug that should be filed against the specific locale. On the other hand, if the character class simply defines the character set in a way that you are not expecting, then it may not be a bug; it may just be a problem that needs to be coded around.

Character Classes as Shortcuts

Character classes are shortcuts for defining sets. You certainly aren't restricted to the pre-defined sets for your locale, and you are free to use the Unicode character sets defined by perlre(1), or simply create the sets explicitly if that provides greater accuracy.

You already know this, so I'm not trying to be pedantic. I'm just pointing out that if you can't or won't fix the locale (which is the source of the problem here) then you should use an explicit set, as you have done.

A convenience class is only convenient if it works for your use case. If it doesn't, toss it overboard!

You need to add the fact that `[a-z]` is a character range but not a class. `[:alpha:]` is using a class — Toote, Aug 13 '12 at 01:49

score 1 · Answer 4 · answered Aug 12 '21 at 18:49

If it is exactly what you want then using [a-z] is not wrong.

But it's wrong to believe that English words consist only of [a-zA-Z] or German of [a-zäöüßA-ZÄÖÜ] or names follow [A-Z][a-z]*.

If we want words in any language or writing system (tested against 2,300 languages each 50 K of the most frequent words) we can use something like this:

#!perl

use strict;
use warnings;
use utf8;

use 5.020;    # regex_sets need 5.18

no warnings "experimental::regex_sets";

use Unicode::Normalize;

my $word_frequencies = {};

while (my $line = <>) {
    chomp $line;
    $line = NFC($line);

    # NOTE: will catch "broken" words at end/begin of line
    #       and abbreviations without '.'
    my @words = $line =~ m/(
        (?[ \p{Word} - \p{Digit} + ['`´’] ])
        (?[ \p{Word} - \p{Digit} + ['`´’=⸗‒—-] ])*
    )/xg;
    
    for my $word (@words) {
        $word_frequencies->{$word}++;
    }
}

# now count the frequencies of graphemes the text uses

my $grapheme_frequencies = {};
for my $word (keys %{$word_frequencies}) {
    my @graphemes = m/(\X)/g;
    for my $grapheme (@grapheme) {
        $grapheme_frequencies->{$grapheme} 
            += $word_frequencies->{$word};
    }
}

For a narrower check we can look into the definition of \p{Word}in the Unicode standard https://unicode.org/reports/tr18/#word

word
    \p{alpha}
    \p{gc=Mark}
    \p{digit}
    \p{gc=Connector_Punctuation}
    \p{Join_Control}

Based on \p{Word} we can now define a Regex for e.g. words in the Latin script:

# word:
    \p{Latin}    # \p{alpha}
    \p{gc=Mark}
    # \p{digit}  # we don't want numerals in words
    \p{gc=Connector_Punctuation}
    \p{Join_Control}

score 0 · Answer 5 · answered Aug 11 '21 at 10:11

for awk, perhaps forcing octal codes on the alphabets should circumvent the inconsistency in awk/poxix/locales

something like

   /[\060-\071       # 0-9
     \101-\132       # A-Z
     \141-\172]/     # a-z

if you want to make them into string constants, perhaps double-the backslashes to ensure the parser/regex engine doesn't get too smart and pre-convert "\101" into A, and give an opening for it to "respect" locale-settings that might not be what you wanted.

"\\101"

Should we consider using range [a-z] as a bug?

5 Answers5

Possible Locale Bugs

Character Classes as Shortcuts

Linked