0

How to keep the special alphabet/character in a text file using Python?

Input text file:

abcÃ/cdéf@-www

I want to remove the symbol, but keep alphabet and special alphabet, symbol means ~!@#$%^*()_+{}<>:"| and so on. After I tried to run my code to do so, here is what I got:

Output text file:

abc  cd f  www

The symbols have been removed and replaced with space which is what I want, but the special alphabets have been removed and replaced with space as well which I don't want. Is there any way to remove symbols but keep special alphabets only?

Expected output text file:

abcà cdéf  www

Here is my code:

string = open('abc.txt', encoding='utf-8').read()
new_str = re.sub('[^a-zA-Z0-9\n\.]', ' ', string)
open('abc.txt', 'w', encoding='utf-8').write(new_str)
Tomerikoo
  • 18,379
  • 16
  • 47
  • 61
Edison Toh
  • 87
  • 1
  • 11

3 Answers3

1

Replace your second line with:

new_str = re.sub('[^\w\s.,;]', ' ', string)
Jānis Š.
  • 532
  • 3
  • 14
  • you've included also `@` which he want to replace. Your answer is best, but you broke it including to string to much stuff. – Marek R Jul 04 '19 at 09:04
  • @Marek. Something bad happend with my fingers. Now edited again. – Jānis Š. Jul 04 '19 at 09:11
  • I had tried replaced this code, but I got this error: ``` UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc3 in position 3: invalid continuation byte ``` – Edison Toh Jul 04 '19 at 09:11
  • Works [pretty well](https://wandbox.org/permlink/nLoEMdeKuKR6VoJx) except for the Hebrew (adds some extra spaces). – Marek R Jul 04 '19 at 09:23
  • @JānisŠ. How to verify my text file is in non-utf8 encoded? How to solve it if it is non-utf8 encoded? – Edison Toh Jul 04 '19 at 09:28
  • @Marek. Yes, it looks that '\w' does not apply correctly to Hebrew characters. However, if you use regex module (https://pypi.org/project/regex/) instead of Python's built-in re module Hebrew characters have no spaces between them. – Jānis Š. Jul 04 '19 at 10:08
  • @EdisonToh problem is how do you open this file? You can tell python what kind of encoding file is using `open(fname, encoding="latin-1")` then strings will be properly loaded and everything should work like a charm. [See this](http://python-notes.curiousefficiency.org/en/latest/python3/text_file_processing.html) – Marek R Jul 04 '19 at 10:13
  • @EdisonToh. The answer here (https://stackoverflow.com/questions/3269293/how-to-write-a-check-in-python-to-see-if-file-is-valid-utf-8) may help you to validate encoding of your file. – Jānis Š. Jul 04 '19 at 10:15
  • @JānisŠ. thanks for your answer, that's works for me! May I know is there any other encoding="xxx" besides encoding="latin-1"? – Edison Toh Jul 04 '19 at 10:47
  • There are plenty of encodings. Look here: https://docs.python.org/3/library/codecs.html#standard-encodings – Jānis Š. Jul 04 '19 at 11:28
  • @JānisŠ. Thanks for your reply! I will look into it, thanks! – Edison Toh Jul 05 '19 at 02:57
0

You can specify to remove the special characters/punctuation only

puncts = re.escape(string.punctuation)
print re.sub(r'['+ puncts +']', '', your_string)
mnestorov
  • 4,116
  • 2
  • 14
  • 24
  • 1
    You should avoid to use `+` to format string : https://realpython.com/python-string-formatting/#3-string-interpolation-f-strings-python-36 – olinox14 Jul 04 '19 at 08:57
  • I had tried replaced this code, but I got this error: ``UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc3 in position 3: invalid continuation byte``. – Edison Toh Jul 04 '19 at 09:18
  • Check your encoding in your interpreter or in your python script – mnestorov Jul 04 '19 at 09:22
0

you can try this:

import re
string = open('abc.txt', encoding='utf-8').read()
new_str = re.sub('[/~!@#$%^*()_+{}<>:"|-]', ' ', string) # put your characters to replace here
open('abc.txt', 'w', encoding='utf-8').write(new_str)

output is:

abcà cdéf  www
Amit Nanaware
  • 3,203
  • 1
  • 6
  • 19
  • I had tried replaced this code, but I got this error: ``UnicodeDecodeError: 'utf-8' codec can't decode byte 0xc3 in position 3: invalid continuation byte``. – Edison Toh Jul 04 '19 at 09:14