Why does casting an int to string in python3 give me output in chinese

Question

Let me start by saying that I do not speak Chinese, nor would there be any reason for my default output to be in Chinese. That said this is both the strangest and most hilarious bug I've ever encountered.

To start with my code is supposed to count the number of times different substrings of length four appear with overlap in a DNA sequence. The relevant code looks like this

#file containing data
f = open(infile, 'r')

#open an additional file to write output to
g = open("k-Mer output.txt", 'w')
#empty list
l=[]

#add lines of file to list
for line in f:
    l.append(line.strip())

d = {}
#adds every unique substring of four to my dict
for i in four_mer_maker():
    d[i] = 0

#l[1] is the sequence to be examined, assume it is all 1 line
#checks four letters, then shifts over one and checks those 4
for i in range(len(l[1]) - 3):
    d[l[1][i:i+4]] += 1

#now just write the ordered values to an output file
for i in sorted(d.items()):
    g.write(str(i[1])+ ' ')

My file is complete gibberish and looks like this

‴‱‴″‰‱‱‵‱″‱′′‱′‰‱‱″‱′‱

even stranger, I tried playing with the output a bit. changing just

g.write(str(i[1])+ 'hello')

Makes my output look like this.

栴汥潬栱汥潬栴汥潬栳汥潬栰汥潬栱汥潬栱

Google translate says its Chinese. What the heck is happening??

Every small integer number (below 1mln or so) represents a Unicode character. Some of those characters are Chinese. Why are you surprised and what did you expect to get by calling `str`?? — DYZ, Feb 03 '18 at 04:59
Probably your text editor is wrong. How are you viewing the file content? — user202729, Feb 03 '18 at 05:02
@DYZ The writing to a file portion of the code is something I've copied from other projects I've had, I've never seen it turn my numbers into chinese. It normally just writes numbers. Also when I remove the added ' ', it'll write all my numbers, just without spaces. — Ryan Schubert, Feb 03 '18 at 05:02
@user202729 its writing to a text file, i'm opening it in notepad — Ryan Schubert, Feb 03 '18 at 05:05
That's a known issue with Notepad, where it tries to predict the encoding. It failed this time. I can confirm that your text file contains the correct data. — user202729, Feb 03 '18 at 05:05
See [this](https://stackoverflow.com/q/6769311/5267751). Try using Notepad++ or other editors. — user202729, Feb 03 '18 at 05:11
Read [this](https://stackoverflow.com/q/5202648/5267751) and [this](https://stackoverflow.com/questions/6048085/writing-unicode-text-to-a-text-file). — user202729, Feb 03 '18 at 05:18
@user202729 I figured it out! I just had to set the encoding for the writable file to UTF-16! I'll edit my post shortly thank you! — Ryan Schubert, Feb 03 '18 at 05:23
@DYZ: Are you thinking of `chr`? `str(100) == '100'`; `chr` is the one that converts integers to corresponding Unicode code points. — user2357112, Feb 03 '18 at 05:35

score 1 · Accepted Answer · answered Feb 03 '18 at 05:27

1

Evidently the notepad was interpreting my output using the incorrect BOM. Changing the following line of code like so

g = open("k-Mer output.txt", 'w')
g = open("k-Mer output.txt", 'w', encoding = 'utf16')

resolves the issue. Thank you to @user202729

answered Feb 03 '18 at 05:27

Ryan Schubert

186
7

Why does casting an int to string in python3 give me output in chinese

1 Answers1