strange output regular expression r'[-.\:alnum:](.*)'

Question

I expect to fetch all alphanumeric characters after "-" For an example:

>>> str1 = "12 - mystr"
>>> re.findall(r'[-.\:alnum:](.*)',  str1)
[' mystr']

First, it's strange that white space is considered alphanumeric, while I expected to get ['mystr'].

Second, I cannot understand why this can be fetched, if there is no "-":

>>> str2 = "qwertyuio"
>>> re.findall(r'[-.\:alnum:](.*)',  str2)
['io']

That's wrong. It should be `-\s*([[:alnum:]]+)`. Also Python's `re` doesn't support POSIX character classes. Try `-\s*(\w+)` instead. — revo, Feb 18 '19 at 22:47
Did [the solution](https://stackoverflow.com/a/54756631/3832970) help? If you still have doubts, please let know via a comment, or please update the question. — Wiktor Stribiżew, Oct 07 '22 at 12:03

score 2 · Answer 1 · answered Feb 18 '19 at 23:01

First of all, Python re does not support POSIX character classes.

The white space is not considered alphanumeric, your first pattern matches - with [-.\:alnum:] and then (.*) captures into Group 1 all 0 or more chars other than a newline. The [-.\:alnum:] pattern matches one char that is either -, ., :, a, l, n, u or m. Thus, when run against the qwertyuio, u is matched and io is captured into Group 1.

Alphanumeric chars can be matched with the [^\W_] pattern. So, to capture all alphanumeric chars after - that is followed with 0+ whitespaces you may use

re.findall(r'-\s*([^\W_]+)', s)

See the regex demo

Details

- - a hyphen
\s* - 0+ whitespaces
([^\W_]+) - Capturing group 1: one or more (+) chars that are letters or digits.

Python demo:

print(re.findall(r'-\s*([^\W_]+)', '12 - mystr')) # => ['mystr']
print(re.findall(r'-\s*([^\W_]+)', 'qwertyuio'))  # => []

score 1 · Answer 2 · answered Feb 18 '19 at 23:02

1

Your regex says: "Find any one of the characters -.:alnum, then capture any amount of any characters into the first capture group".

In the first test, it found - for the first character, then captured mystr in the first capture group. If any groups are in the regex, findall returns list of found groups, not the matches, so the matched - is not included.

Your second test found u as one of the -.:alnum characters (as none of qwerty matched any), then captured and returned the rest after it, io.

As @revo notes in comments, [....] is a character class - matching any one character in it. In order to include a POSIX character class (like [:alnum:]) inside it, you need two sets of brackets. Also, there is no order in a character class; the fact that you included - inside it just means it would be one of the matched characters, not that alphanumeric characters would be matched without it. Finally, if you want to match any number of alphanumerics, you have your quantifier * on the wrong thing.

Thus, "match -, then any number of alphanumeric characters" would be -([[:alnum:]]*), except... Python does not support POSIX character classes. So you have to write your own: -([A-Za-z0-9]*).

However, that will not match your string because the intervening space is, as you note, not an alphanumeric character. In order to account for that, -\s*([A-Za-z0-9]*).

answered Feb 18 '19 at 23:02

Amadan

191,408
23
240
301

Note it is not always true that "you need two sets of brackets." ICU regex library allows the use of "bare" POSIX character classes, and `[:digit:]+` matches one or more digits. – Wiktor Stribiżew Feb 18 '19 at 23:09
@WiktorStribiżew "In order to include a POSIX character class inside it". In your example, the POSIX character class stands independent, not embedded in another character class expression. – Amadan Feb 18 '19 at 23:13
If you refer to my comment, yes, it is "independent" and it works as is in ICU regex. Try yourself in R `stringr` functions like `str_extract` or in Swift. – Wiktor Stribiżew Feb 18 '19 at 23:15
@WiktorStribiżew I meant, your comment does not apply to my text. I specifically said _embedded_ POSIX classes need an extra pair of brackets (i.e. `[-[:alnum:]]` is "a hyphen or any alphanumeric", `[-:alnum:]` is just "one of `-:alnum`") . As you say, independent POSIX classes are fine with a single pair. This is valid outside ICU; Onigmo does the same thing. – Amadan Feb 18 '19 at 23:16
Sorry, I can't find this in your answer, hence decided to mention that. Also, just in case, in POSIX terminology, those "outer" brackets around the POSIX character class are referred to as bracket expressions, not character classes. Also, I am [not sure about Ruby support](https://ideone.com/LPjjp7) for bare POSIX character classes. – Wiktor Stribiżew Feb 18 '19 at 23:21

score 0 · Answer 3 · answered Feb 19 '19 at 00:20

Not quite sure what you want to match. I'll assume you don't want to include '-' in any matches.

If you want to get all alphanumeric chars after the first '-' and skip all other characters you can do something like this.

re.match('.*?(?<=-)(((?<=\s+)?[a-zA-Z\d]+(?=\s+)?)+)', inputString)

If you want to find each string of alphanumerics after a each '-' then you can do this.

re.findall('(?<=-)[a-zA-Z\d]+')

strange output regular expression r'[-.\:alnum:](.*)'

3 Answers3