4

I do the following:

re.sub(r'[^ \nA-Za-z0-9/]+', '', document)

to remove every character which is not alphanumeric, space, newline, or forward slash.

So I basically I want to remove all special characters except for the newline and the forward slash.

However, I do not want to remove the accented letters which various languages have such as in French, German etc.

But if I run the code above then for example the word

Motörhead

becomes

Motrhead

and I do not want to do this.

So how do I run the code above but without removing the accented letters?

UPDATE:

@MattM below has suggested a solution which does work for languages such as English, French, German etc but it certainly does not work for languages such as Polish where all the accented letters were still removed.

Outcast
  • 4,967
  • 5
  • 44
  • 99

2 Answers2

8

I'm pretty sure this would do what you need

x = re.sub(r'[^ \nA-Za-z0-9À-ÖØ-öø-ÿ/]+', '', 'Motörhead')

Also check here for a discussion about javascript regex, which has some relevant info despite any differences


EDIT -

To expand on Outcast's new concern - yes you could include non-Latin characters. However it may get too cumbersome. If you look at a list of Unicode chars, I was including ranges of accented Latin chars. So if you wanted to include all Cyrillic characters as well, we would add Ѐ-ӿ to the regex.

import re

yourString = 'Cyrillic Char Ѥ'
yourString = re.sub(r'[^ \nA-Za-z0-9À-ÖØ-öø-ÿЀ-ӿ/]+', '', yourString)
text_file = open("Output.txt", "wb")
text_file.write(yourString.encode('utf8'))
text_file.close()

However with this method you may have to include many ranges, depending on which chars from which languages you want or don't want.

Matt M
  • 691
  • 2
  • 6
  • 17
  • Thanks but do you include in these expression above all the accented characters of the main European languages? To start with, I am not sure that you have included all the accented letters of French. – Outcast May 24 '19 at 09:01
  • 1
    It may work actually as I tested some examples. The question is now what I should do if I want to expands these to languages (European or not) which do not have the latin alphabet. – Outcast May 24 '19 at 09:13
  • I don't know much about non-Latin languages, but you can see what I did by looking at a [list of unicode characters](https://en.wikipedia.org/wiki/List_of_Unicode_characters) and seeing that I just included ranges of accented chars. So if I wanted to also include all Cryllic letters I could make it `re.sub(r'[^ \nA-Za-z0-9À-ÖØ-öø-ÿЀ-ӿ/]+', '', yourString)` and `print(yourString.encode('utf-8'))` to see it works. – Matt M May 24 '19 at 13:00
  • Ok thank you. I may ask a separate question for this. – Outcast May 24 '19 at 14:03
  • Probably a good idea because the best way to go about that may be very different than the solution for your question here. – Matt M May 24 '19 at 14:04
  • Hm it worked for languages such as English, French, German etc (upvote) but it did not work for other languages such as Polish for example. I will probably post a new question about it. – Outcast May 29 '19 at 16:32
  • While also using non-Latin chars probably is suitable for a different question (as there may be another type of solution for this) - did you try the solution in the comments where I added all Cyrillic chars to the regex? That concept should work for you. Check my post, I edited in more information. – Matt M May 29 '19 at 17:58
0

You may be able to tinker with the character coding as well. I don't know if you use utf-8. I somehow had my python file in utf8 (I use Windows and YMMV,) but

#coding: iso-8859-1
import re

x = "Mötörhead MÖTÖRHEAD"

y = re.sub(r'[^\xe0-\xff]', '', x)
print(y, "only keeps accented lower-case characters from", x)
z = re.sub(r'[^\xc0-\xff]', '', x)
print(z, "keeps all accented characters from", x)

The first commented out line is important. Without it, python throws a coding error for me.

You can use Windows Charmap (windows western character set) if you wish to tweak the hex values of the characters you specifically want. xc0 is the start of the capital letter accented characters. But Matt M's code is more readable if you only want to zap specific characters or vowels. Mine cuts corners as it also zaps a divide sign (0xf7) and multiplication (0xd7).

aschultz
  • 1,658
  • 3
  • 20
  • 30