0

I use this code: b = re.sub('[^A-Za-z]+', ' ', a). Nevertheless i need to take account of the french accents: àâéèêëïîôùûç. Can you please help me? :)

Thanks.

Val
  • 31
  • 6

2 Answers2

0

If you're like to replace all the letters, taking into account unicode, do the following:

text = "àâéèêëïîôùûç"
re.sub('\w+', ' ', text, re.UNICODE)

Please note that the re.UNICODE is not needed in python3, as it does unicode matching by default.

Roy2012
  • 11,755
  • 2
  • 22
  • 35
  • 1
    FYI: always use `flags=`, `count=` etc explicitly as the positions change between different functions. For ex: `sub(pattern, repl, string, count=0, flags=0)` and `findall(pattern, string, flags=0)` ... so, your code is actually doing `count=re.UNICODE` – Sundeep Jun 25 '20 at 09:21
0

Regex for accented characters has been covered before really well over here.

If you're dealing with French accents (not umlauts etc) then you're code could be updated like this:

b = re.sub('[^A-zÀ-ú]+', ' ', a)

That should amend your previous "all upper and lower case letters" to "all upper and lower case letters including accents"

houseofleft
  • 347
  • 1
  • 12
  • 4
    `À-ú` matches much more than french accented character. And doesn't mach lowercase. `A-z` matches more tan just letters. Have a look at an [ASCII table](http://www.asciitable.com/) – Toto Jun 01 '20 at 09:16