I am performing simple string comparison between two Chinese characters which are both properly decoded (I think) from UTF-8, however, the results are still non-equal and I haven't been able to figure out why. One character is being read from an input file and the other is from a decoded EPUB book.
What I've tried:
- I have decoded the file from UTF-8 and the EPUB book's content also from UTF-8.
- Read a number of posts about similar problems, but everything I could find boiled down to people not knowing how to decode the string correctly.
The Code
Read in the file where I get the character to compare:
with open(input_file_name, encoding="utf-8") as input_file:
In this case, the file is a single line with the character: 子
Read in the ebook and then try to find the character:
book = epub.read_epub(args.ebook_path)
for doc in book.get_items_of_type(ebooklib.ITEM_DOCUMENT):
content = doc.content.decode('utf-8')
print(content)
if word in content:
print("MATCH FOUND")
break
From the code above you can see I'm printing the content of each item in the book. Part of that output includes:
<td class="b_cell1" width="90%"><p class="p_index_">zǐ 子</p>
where the character clearly appears.
What I Expected
I expected the two characters to match. However, if I change the code to:
word = '子'
for doc in book.get_items_of_type(ebooklib.ITEM_DOCUMENT):
content = doc.content.decode('utf-8')
print(content)
if word in content:
print("MATCH FOUND")
break
it will print MATCH FOUND and appropriately find the character. If I inspect the binary values of the character read from the file and the overwritten word shown above:
- Value of 子 from my file: b'\xef\xbb\xbf\xe5\xad\x90'
- Value of 子 as word shown in the code snippet above: b'\xe5\xad\x90'