1

So I have this list of word in a textfile, I did not produce the text file so I do not know the file encoding.

the list : http://s000.tinyupload.com/?file_id=31195244104486221180

Notepad++ tells me that it's ANSI.

When running this script (reader1.py) :

if __name__ == '__main__':
    words = open("test_list.txt").read().splitlines()
    for word in words:
        print word
        with open("test_list-rewrite.txt", "a") as myfile:
                myfile.write(word + '\n')

the word piirilä is displayed as piirilõ in the console, however in the new file it's stored as piirilä

What I wonder is, if I compute the hash256 of the variable word, will it run it on piirilä or piirilõ?

word = word.decode('cp-1252') raise an exception

Thanks

PS : Windows 8.1 64 bits, python 2.7 64 bits

Edit

after some more fidling I found something weird, made this

#!/usr/bin/env python
# --*-- encoding: utf-8 --*--

import hashlib

word1 = 'piirilä'
word2 = 'piirilõ'
word3 = 'Whatitis'

print word1
print hashlib.sha256(word1).hexdigest()
print word2
print hashlib.sha256(word2).hexdigest()
print word3
print hashlib.sha256(word3).hexdigest()

which outputs this :

piirilä
278394edd22799ae29bc881dc66e45e45a9a18972c45a35208b6a3d71e209a10
piiril├Á
7e158cf465d3afadd865684f979f46a5282ef93127c150b55273801086fa3c09
Whatitis
d338e8077b6c9d3d2f09e4e2d4a2a5f52152b72e9b6bb5c456a67f63d853e75f

And I added hashlib.sha256(word).hexdigest() to reader1.py

which then outputs this :

billycorgan
d94a3821ad2b6d26aedf4db13b551d9e0eefeaf92d0615946cdc0215ec974692
brescos64
8840d0e40a83d711ce0b44ed66a5d1e4df06fbf6c5c168e98af4775c6e19f52b
matvois
ef5e930806489e8fcc8e0746ce5f8cb4c6715a56d2fd73d42b1c711b5e71474f
kbeans
c207d8366f3dbae64357088dee8eeeb35a047b2e021342c82aa0bd8c15753d74
Whatitis
d338e8077b6c9d3d2f09e4e2d4a2a5f52152b72e9b6bb5c456a67f63d853e75f
cphu
1427ebcff066a5386d0649842fb60b014bebfc5a1589896a62488865e8f06c50
de'mystifierait
83665461f98de4c270e6a4d69a445ea2f9079693824c0544a9add4caee5c7dd2
wendelboe
1423bf5d682dafdc72937d92811b5ff9d856681e94204d565cb0f29b809f5e13
ketanshah
f9977718f33f9068f20c52321ef02be3611e7c7a0aebb59421e74f864c259f53
piirilõ
a238ede50bc349279c62399b275cfa3271f63bc5e7499cc40aaa4ff84198666d
gasoline
4325ed4bef2a2a10c97cbb8235f822602efc0f04a900f0eb537f8e9fee9728aa
BabyBlues
8168fce33124ecec74e647f119de5b3cda795dcc69c4237d8cf27b10aca07b94

so I get 3 different hashes, which one is the one I want?

Community
  • 1
  • 1
sliders_alpha
  • 2,276
  • 4
  • 33
  • 52
  • I **strongly** recommend you try this again with Python 3, so we can diagnose and fix the problem in a reasonable way. I cannot recommend this strongly enough; using Python 2 to learn about encodings is a recipe for needless pain. – alexis Dec 06 '17 at 14:25

2 Answers2

1

had a look at your textfile.

The linux file command told me its charset is ISO-8859 text (, with CRLF line terminators). So may be thats why your

word = word.decode('cp-1252') 

raised an exception.

Have a look at Determine the encoding of text in Python which is about determining the encoding of a textfile in python.

Best, me

pydvlpr
  • 311
  • 2
  • 6
0

I compute the hash256 of the variable word, will it run it on piirilä or piirilõ?

The hash will not run on either one; it will run on the sequence of bytes in your variable, whose last byte represents an õ in one encoding and an ä in a different encoding. Apparently your console has a different default encoding from that of Notepad++, so you see the same byte displayed in different ways.

Your test script contains utf-8 encoded text, which is yet another sequence of bytes (two bytes for each of the accented characters, which is why you see two funny symbols in the output; or try printing repr(word1)). If you want to know the hash of the word that's stored in your file, write a program that reads that from the file and computes its hash.

The real solution to your question is to switch to Python 3. You'll then be able to run this code:

words = open("test_list.txt", encoding="latin1").read().splitlines()
for word in words:
    print(word)

Then you can try out different encodings until you find out the right one (in your case, "latin1" seems right). On Python 2, you can do the same after this import:

from codecs import open

But you'll then have unicode strings instead of str, and various confusing things will probably happen. Switching to Python 3 makes it uncecessary to deal with all that.

alexis
  • 48,685
  • 16
  • 101
  • 161
  • wait, so "toto" encoded in latin1 and utf8 will not produce the same hash256? – sliders_alpha Dec 07 '17 at 12:02
  • No, they will give the same hash. "toto" is ascii (unless you snuck in a funny character). Both utf8 and latin1 are designed so that they are supersets of ascii, so ascii characters are identically encoded. (All the 8-bit ISO-8859 encodings are supersets of ascii; latin1 is one of them). This is why you can generally ignore encodings when dealing with English text. – alexis Dec 07 '17 at 12:13
  • One qualification: If you use a text editor on Windows to save "toto" to a file in utf-8 format, Windows likes to insert a "byte order mark" (actually unnecessary, and against Unicode recommendations) that Python 2 will read as part of the string, and it will affect the hash. But the actual string "toto", e.g. in a Python literal, is just four ascii bytes in utf-8. – alexis Dec 07 '17 at 12:20