Replace all characters except for alphanumerics from all languages

Question

How to keep the special alphabet/character in a text file using Python?

Input text file:

abcÃ/cdéf@-www

I want to remove the symbol, but keep alphabet and special alphabet, symbol means ~!@#$%^*()_+{}<>:"| and so on. After I tried to run my code to do so, here is what I got:

Output text file:

abc  cd f  www

The symbols have been removed and replaced with space which is what I want, but the special alphabets have been removed and replaced with space as well which I don't want. Is there any way to remove symbols but keep special alphabets only?

Expected output text file:

abcÃ cdéf  www

Here is my code:

string = open('abc.txt', encoding='utf-8').read()
new_str = re.sub('[^a-zA-Z0-9\n\.]', ' ', string)
open('abc.txt', 'w', encoding='utf-8').write(new_str)

```new_str = ''.join([char fro char in string if char not in "~!@#$%^*()_+{}<>:\"|"])``` probably not best solution but still a working solution — Xiidref, Jul 04 '19 at 08:49
@xiidref this is a solution, there is also an `isalpha` method that could work here: `"àbc".isalpha() # >> True` — olinox14, Jul 04 '19 at 08:52

Jānis Š. · Accepted Answer · 2019-07-04T09:48:39.350

1

Replace your second line with:

new_str = re.sub('[^\w\s.,;]', ' ', string)

edited Jul 04 '19 at 09:48

answered Jul 04 '19 at 08:51

Jānis Š.

532
3
14

you've included also `@` which he want to replace. Your answer is best, but you broke it including to string to much stuff. – Marek R Jul 04 '19 at 09:04
@Marek. Something bad happend with my fingers. Now edited again. – Jānis Š. Jul 04 '19 at 09:11
I had tried replaced this code, but I got this error: ``` UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc3 in position 3: invalid continuation byte ``` – Edison Toh Jul 04 '19 at 09:11
Works [pretty well](https://wandbox.org/permlink/nLoEMdeKuKR6VoJx) except for the Hebrew (adds some extra spaces). – Marek R Jul 04 '19 at 09:23
@JānisŠ. How to verify my text file is in non-utf8 encoded? How to solve it if it is non-utf8 encoded? – Edison Toh Jul 04 '19 at 09:28
@Marek. Yes, it looks that '\w' does not apply correctly to Hebrew characters. However, if you use regex module (https://pypi.org/project/regex/) instead of Python's built-in re module Hebrew characters have no spaces between them. – Jānis Š. Jul 04 '19 at 10:08
@EdisonToh problem is how do you open this file? You can tell python what kind of encoding file is using `open(fname, encoding="latin-1")` then strings will be properly loaded and everything should work like a charm. [See this](http://python-notes.curiousefficiency.org/en/latest/python3/text_file_processing.html) – Marek R Jul 04 '19 at 10:13
@EdisonToh. The answer here (https://stackoverflow.com/questions/3269293/how-to-write-a-check-in-python-to-see-if-file-is-valid-utf-8) may help you to validate encoding of your file. – Jānis Š. Jul 04 '19 at 10:15
@JānisŠ. thanks for your answer, that's works for me! May I know is there any other encoding="xxx" besides encoding="latin-1"? – Edison Toh Jul 04 '19 at 10:47
There are plenty of encodings. Look here: https://docs.python.org/3/library/codecs.html#standard-encodings – Jānis Š. Jul 04 '19 at 11:28
@JānisŠ. Thanks for your reply! I will look into it, thanks! – Edison Toh Jul 05 '19 at 02:57

score 0 · Answer 2 · answered Jul 04 '19 at 08:52

0

You can specify to remove the special characters/punctuation only

puncts = re.escape(string.punctuation)
print re.sub(r'['+ puncts +']', '', your_string)

answered Jul 04 '19 at 08:52

mnestorov

4,116
2
14
24

1

You should avoid to use `+` to format string : https://realpython.com/python-string-formatting/#3-string-interpolation-f-strings-python-36 – olinox14 Jul 04 '19 at 08:57
I had tried replaced this code, but I got this error: ``UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc3 in position 3: invalid continuation byte``. – Edison Toh Jul 04 '19 at 09:18
Check your encoding in your interpreter or in your python script – mnestorov Jul 04 '19 at 09:22

score 0 · Answer 3 · answered Jul 04 '19 at 08:54

0

you can try this:

import re
string = open('abc.txt', encoding='utf-8').read()
new_str = re.sub('[/~!@#$%^*()_+{}<>:"|-]', ' ', string) # put your characters to replace here
open('abc.txt', 'w', encoding='utf-8').write(new_str)

output is:

abcÃ cdéf  www

answered Jul 04 '19 at 08:54

Amit Nanaware

3,203
1
6
19

I had tried replaced this code, but I got this error: ``UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc3 in position 3: invalid continuation byte``. – Edison Toh Jul 04 '19 at 09:14

Replace all characters except for alphanumerics from all languages

3 Answers3