0

My goal is to remove all symbols from a string and still preserve the unicode characters (alphabetical character from any language). Suppose I have the following string:

carbon copolymers—III❏£\n12- Géotechnique\n

I want to remove the , and £ characters between copolymers and \n. I was looking at here and thought maybe I should go with regex and remove all symbols given the correct unicode characters range. The range of characters that I have in my text file varies from Latin to Russian and ... . However the regex code I've written below doesn't help.

>>> s = u'carbon copolymers—III❏£\n12- Géotechnique\n'
>>> re.sub(ur'[^\u0020-\u00FF\n]+',' ', s)

There seems to be two problems with this method:

1) Different unicode ranges still include some symbols.

2) Sometimes, for some unknown reason the returned result seems to be totally different than what it is supposed to be.

Here's the result of the code above:

carbon copolymers\xe2\x80\x94III\n12- G\xc3\xa9otechnique\n
>>> print u'carbon copolymers\xe2\x80\x94III\n12- G\xc3\xa9otechnique\n'
carbon copolymersâIII
12- Géotechnique 

Do you know any better way of doing this? Is there a full list of all symbols? Do you have any other ideas rather than regex?

Thank you

Amir
  • 10,600
  • 9
  • 48
  • 75
  • Didn't you ask this question before? I think I've seen the same question 3 or 4 times in the last days. – Mariano Dec 02 '15 at 03:13
  • 3
    @Mariano No that was a different one. Here it is important to keep all unicode characters except symbols while that was not the case in the previous [question](http://stackoverflow.com/questions/33990023/maintaining-the-consistency-of-strings-before-and-after-converting-to-ascii). – Amir Dec 02 '15 at 03:15
  • It is more or less the same question as http://stackoverflow.com/questions/33990023/maintaining-the-consistency-of-strings-before-and-after-converting-to-ascii. The answer is the same. You're just making it a bit more complicated by writing bytes to an Unicode literal for whatever reason. – roeland Dec 02 '15 at 03:19
  • No it's different. How come the previous question's answer preserves the character **é** the same in the final result? – Amir Dec 02 '15 at 03:20
  • @roeland The reason is something very important that I'm doing for my purposes. I tried working on this problem with many different methods and spent 3 hours on it. Haven't been able to resolve it as of now. It's about processing millions words in a huge project. So, that's something important for me :) – Amir Dec 02 '15 at 03:22
  • Because in the other question's answer `re.sub` it uses `\w` and `flags=re.U` to prevent selecting anything Unicode considers alphanumeric. It leaves the accented e in place, then the `normalize` trick removes the accent. – Mark Tolonen Dec 02 '15 at 03:25
  • 2
    @Amir Then you should start reading a bit. Start with https://docs.python.org/2/howto/unicode.html . If possible you should switch to Python 3. – roeland Dec 02 '15 at 03:25
  • @roeland I would switch to Python 3 immediately if you can help me get my desired result here. Most libraries used in the project do not depend on python 2.7. – Amir Dec 02 '15 at 03:26
  • @Amir Well, again, see the answer on the previous question. It works as expected in Python 3 as well. – roeland Dec 02 '15 at 03:28
  • 1
    This question doesn't mention removing the accents, but still, the other question removes the symbols and you can skip the `normalize` trick to remove the accents. – Mark Tolonen Dec 02 '15 at 03:30
  • @Amir I agree with roeland and Mark's comments. If you must exclude a specific list of characters, the alternative is to check Unicode Categories here: http://www.fileformat.info/info/unicode/category/index.htm and exclude whatever you want. – Mariano Dec 02 '15 at 03:32
  • Well that was the first thing that I did. Simply using re.sub(r'[^\w\n] ',r' ', s) does something strange to the string and the result becomes this: ***carbon copolymersâIII 12 Géotechnique*** – Amir Dec 02 '15 at 03:35
  • Can anyone tell me why those unicode characters are changed eventhough they are not supposed to?! – Amir Dec 02 '15 at 03:36
  • 1
    `—` isn't a Unicode symbol, it's punctuation. – Jon Hanna Dec 02 '15 at 03:36
  • @JonHanna It's not a 'dash' character. Besides, I have many many more symbolic characters that are all unicode. – Amir Dec 02 '15 at 03:40
  • I gave you a *big* hint in my comment to your previous question, including a useful link. If you didn't bother to read that, why should I or anyone else bother to help further? – Mark Ransom Dec 02 '15 at 04:46
  • 1
    @Amir, it is a dash character. If I copy-paste it out of your question and into a script that tells me what a character is, it's definitely an em dash. (And also, of course all your "symbolic characters" are unicode, all characters are unicode). – Jon Hanna Dec 02 '15 at 09:56
  • @JonHanna No it's not ASCII. The character '—' is different than '-'. The former is in Unicode and the later is ASCII. Use this [function](http://stackoverflow.com/a/196391/2838606) to check for that. Well Mark, I thought maybe I can get a quick answer here from the experts instead of just banging my head into the monitor for hours (like what I did last night and didn't get anywhere). Thanks indeed – Amir Dec 02 '15 at 13:39
  • 1
    I didn't say it was ASCII. I said it was punctuation. (Also, `-` is both ASCII and also Unicode. All ASCII characters are Unicode characters too). You are talking about Unicode symbols and then using a punctuation character rather than a symbol character as an example. Either your example is wrong, or you are asking the wrong thing. – Jon Hanna Dec 02 '15 at 13:43
  • 1
    @Amir, you complain about using `re.sub(r'[^\w\n] ',r' ', s)` but it's `re.sub(r'[^\w\n] ',r' ', s, flags=re.U)`. – Mark Tolonen Dec 02 '15 at 13:54

1 Answers1

1

I think found a good solution (>99% robust I believe) to the problem:

Well here's our new, horrific string:

s = u'carbon҂ ҉ copolymers—⿴٬ٯ٪III❏£\n12-ः׶ Ǣ ܊ܔ ۩۝۞ء܅۵Géotechnique▣ऀ\n'

And here's the resulting string:

u'carbon    copolymers   \u066f III  \n      \u01e2  \u0714    \u0621  G\xe9otechnique  \n'

All the remained characters/words are in fact alphabetical characters, in different languages. Done with almost no effort!

Here's the solution:

s = ''.join([c if c.isalpha() or c.isspace() else ' ' for c in s])
s = re.sub(ur'[\u0020-\u0040]+|[\u005B-\u0060]+|[\u007B-\u00BF]+', ' ', s)
s = re.sub(r'[ ]+', ' ', s)
carbon copolymers ٯ III  
Ǣ ܔ ء Géotechnique  
Amir
  • 10,600
  • 9
  • 48
  • 75