1

I am trying to write a regex validator in python that included accented characters for instance in French, however, I cannot find a valid regex pattern that does this. I have already tried: What's a good regex to include accented characters in a simple way?

But I still cannot validate ådam for instance I basically want all alphanumeric characters plus - , space and apostrophe included but the regex I used is not working:

(?i)^(?:(?![×Þß÷þø])[-'0-9a-zÀ-ÿ ])+$

ashes999
  • 1,234
  • 1
  • 12
  • 36
  • 2
    The example works when I try it on [Regex101](https://regex101.com/r/teTifr/1)? – Timus Oct 02 '20 at 02:39
  • Why don't you use `\w` and `UNICODE` flag? Yes, this character class will be wider than just French accented letters, but you ought to be multicultural these days. – Alexander Mashin Oct 02 '20 at 02:48
  • @Timus for instance it does not support `ßloc` – ashes999 Oct 02 '20 at 02:54
  • 3
    @ashes999 That's due to the negative lookahead `(?![×Þß÷þø])` (or am I missing something?) – Timus Oct 02 '20 at 03:03
  • The answer depends on what kind of text you want your code to handle. The correct way involves normalizing the input, and use `regex` package with full case folding to match the text, since there could be cases where one visible character consists of 2 or more Unicode character - the base character and composing characters – nhahtdh Oct 02 '20 at 09:14

2 Answers2

1

This can be done by importing regex package and using Unicode category \p{L} to match any kind of letter from any language. Apostrophe, spaces, hyphens and digits 0-9 are also matched.

import regex

string = "abcd ßloc ådam - + * 1 2 3 ''×Þß÷þø À-ÿ"
pattern = r"[\p{L}\d-' ]+"
result = regex.findall(pattern, string)

print(result)

# OUTPUT
# ['abcd ßloc ådam - ', ' ', " 1 2 3 ''", 'Þß', 'þø À-ÿ']
Ankit
  • 682
  • 1
  • 6
  • 14
1

You can use the built-in re module with the following expression:

^(?:[^\W_]|[ '-])+$

Details

  • ^ - start of string
  • (?:[^\W_]|[ '-])+ - one or more occurrences of
    • [^\W_] - any letter or digit
    • | - or
    • [ '-] - space, apostrophe
  • $ - end of string.

See the regex demo

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563