2

I feel like I am missing something trivial here:

I recently made the jump over to Python3 (Using PyDev in Eclipse).
I have a project that calculates entropy values, and contains the following bit of code:

data = b'NVGI\x19\x01\x10\x00'
seen = dict(((chr(x), 0) for x in range(0,256)))
for byte in data:
    seen[byte]+=1

The binary string in data is much longer, but this is sufficient for demonstration purposes.

With python3 this code results in an KeyError. With python2.7 this works with no issues at all.

This is due to the fact that iterating over data returns an integer (78 in this case) while seen expects a character 'N' instead.

The curious thing is that in python2.7 the same code will produce the expected character 'N'.

I have for now band-aided this by doing:

seen[ord(byte)] += 1

Can someone please try and replicate this or tell me where I went wrong?

Araiguma
  • 69
  • 1
  • 4
  • 2
    Why don't you use a `counter`? – Willem Van Onsem Apr 08 '17 at 17:22
  • 1
    You want the same code to run in both Python 2 and 3? – kennytm Apr 08 '17 at 17:25
  • http://stackoverflow.com/questions/28249597/why-do-i-get-an-int-when-i-index-bytes, http://stackoverflow.com/questions/14267452/iterate-over-individual-bytes-in-python3 – Ilja Everilä Apr 08 '17 at 17:26
  • Even with a counter, I would be forced to do `seen[data[counter]]` where `data[counter]` still returns an integer instead of and character. – Araiguma Apr 08 '17 at 17:26
  • @Ilja: Thanks, that's actually insightful. Still extremely weird to have such undocumented differences between python 2 and 3. – Araiguma Apr 08 '17 at 17:28
  • @kennytm No, just python3. I also tested it side by side on my old python2 setup because the result was unexpected and found the difference. – Araiguma Apr 08 '17 at 17:29
  • 2
    Hardly undocumented: https://docs.python.org/3.0/whatsnew/3.0.html#text-vs-data-instead-of-unicode-vs-8-bit – Yann Vernier Apr 08 '17 at 17:31
  • I should correct: Not documented where I looked, since I assumed this to be an issue with dictionary construction or list iteration, not the underlying bytes data type. – Araiguma Apr 08 '17 at 17:34
  • The mismatching expectation is that `chr(x)` would produce a byte. It produces a `str`. I'm a little surprised that `ord` accepts `bytes` objects, actually. In your particular case, though, it would appear either an array or a collections.Counter is a better choice than a dictionary, and letting the bytes be ints. – Yann Vernier Apr 08 '17 at 17:46
  • Exactly. For now I fixed my code by removing the `chr(x)`. I may or may not Upgrade towards the counter when the deadline is less pressuring. Thank you again for your valuable input. – Araiguma Apr 08 '17 at 20:29

1 Answers1

5

Because in the elements of a binary string are ints.

Indeed:

>>> type(data[0])
<class 'int'>

This was also specified in the "What's New In Python 3.0" documentation.

So your you can solve the issue by for instance constructing the dictionary like:

seen = dict(((x, 0) for x in range(0,256)))

Or you can do it the opposite way:

data = b'NVGI\x19\x01\x10\x00'
seen = dict(((chr(x), 0) for x in range(0,256)))
for byte in data:
    seen[chr(byte)]+=1

But a more elegant solution is to simply use a counter:

from collections import Counter

result = Counter(data)

Which generates:

>>> Counter(data)
Counter({16: 1, 1: 1, 86: 1, 25: 1, 73: 1, 71: 1, 78: 1, 0: 1})

A Counter is a subclass of dict, so all dictionary operations are supported on the counter.

In case you want the counter to contain string values, you can do it like:

result = Counter(chr(x) for x in data)

This gives:

>>> Counter(chr(x) for x in data)
Counter({'\x00': 1, 'G': 1, 'I': 1, '\x01': 1, 'V': 1, 'N': 1, '\x10': 1, '\x19': 1})

Note that if you query a counter for a non-existing key, it will return 0. So you also save on memory so to speak since you do not necessarily have counts for all 256 values.

Willem Van Onsem
  • 443,496
  • 30
  • 428
  • 555