How to filter Unicode characters

Question

I have a file containing a list of Unicode characters which (due to a copy paste fail) also has the hex code every 16 characters e.g.

Ս Վ Տ 0550 Ր Ց Ւ Փ Ք Օ Ֆ ՗ ՘ ՙ ՚ ՛ ՜ ՝ ՞ ՟ 0560 ՠ ա բ գ

with the 0550 and 0560in the middle. I want to make a program that will remove these numbers, but when I try to read the file, it raises an error:

Traceback (most recent call last):
  File "C:\Users\Millicent\Desktop\a.py", line 1, in <module>
    open('characters.txt').read()
  File "C:\Python34\lib\encodings\cp1252.py", line 23, in decode
    return codecs.charmap_decode(input,self.errors,decoding_table)[0]
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 392: character maps to <undefined>

My current code is

with open('character.txt','r') as file:
    chars = file.read().split()

def isdigit(string):
    try:
        int(string, 16)
        return True
    except:
        return False

chars = list(filter(lambda s: len(s) != 4 and isdigit(s), chars))

with open('characters.txt','w') as file:
    file.write(''.join(chars))

Can someone tell me how to make Python accept the special characters?

You do need to open the file with the right codec (`open('characters.txt', encoding='....')`) and not rely on the default. — Martijn Pieters, May 15 '17 at 15:43
@MartijnPieters Post that as an answer and i'll accept it. Thanks! — caird coinheringaahing, May 15 '17 at 15:44
I've duped that to the canonical question on `open()` on Python 3.x on Windows instead. — Martijn Pieters, May 15 '17 at 15:48

How to filter Unicode characters

0 Answers0