1

I am performing simple string comparison between two Chinese characters which are both properly decoded (I think) from UTF-8, however, the results are still non-equal and I haven't been able to figure out why. One character is being read from an input file and the other is from a decoded EPUB book.

What I've tried:

  • I have decoded the file from UTF-8 and the EPUB book's content also from UTF-8.
  • Read a number of posts about similar problems, but everything I could find boiled down to people not knowing how to decode the string correctly.

The Code

Read in the file where I get the character to compare:

with open(input_file_name, encoding="utf-8") as input_file:

In this case, the file is a single line with the character: 子

Read in the ebook and then try to find the character:

book = epub.read_epub(args.ebook_path)

for doc in book.get_items_of_type(ebooklib.ITEM_DOCUMENT):
    content = doc.content.decode('utf-8')
    print(content)
    if word in content:
        print("MATCH FOUND")
        break

From the code above you can see I'm printing the content of each item in the book. Part of that output includes:

<td class="b_cell1" width="90%"><p class="p_index_">zǐ 子</p>

where the character clearly appears.

What I Expected

I expected the two characters to match. However, if I change the code to:

word = '子'

for doc in book.get_items_of_type(ebooklib.ITEM_DOCUMENT):
    content = doc.content.decode('utf-8')
    print(content)
    if word in content:
        print("MATCH FOUND")
        break

it will print MATCH FOUND and appropriately find the character. If I inspect the binary values of the character read from the file and the overwritten word shown above:

  • Value of 子 from my file: b'\xef\xbb\xbf\xe5\xad\x90'
  • Value of 子 as word shown in the code snippet above: b'\xe5\xad\x90'
Grant Curell
  • 1,321
  • 2
  • 16
  • 32

1 Answers1

1

The problem was what is called the byte order mark. That is what those extra three bytes (\xef\xbb\xbf) are on my variable.

From this post.


Simply use the "utf-8-sig" codec:

fp = open("file.txt")
s = fp.read()
u = s.decode("utf-8-sig")

That gives you a unicode string without the BOM. You can then use

s = u.encode("utf-8")

to get a normal UTF-8 encoded string back in s [a reference to the original posts' variable].

Grant Curell
  • 1,321
  • 2
  • 16
  • 32