2

I am working with datasets that include encrypted strings. Using pandas.unique() on the columns containing encrypted strings often returns fewer results than expected, especially when the strings are similar. After some experimenting, it seems that this is a problem with certain characters not being taken into consideration by the unique() function, but I can't figure out why that is. Interestingly, pandas.duplicated() works just fine where .unique() doesn't.

Example code which demonstrates the problem:

import pandas as pd

arr = [
b'''`\x95\x00\x95\xba\xe6M\xef\x9f\xdd\xc1\xe5\xc5\xd9\'\\e\x88O\xc4\\\xf7\xd2\xe8J8xJ\x84\x0c\xb1\x1b\xe5>Y\'\xd8\n\xf9U\xc0\xd5\xc8\xe8\xfc\xd5\xf8\xaf`1\xdc|$\x96\xdcin\x7f\x12\xa0\xbd\x03+_\x13o\xb1\x8a\x14\xc5\x0b\xb4\xa8\xfc\x12\xbd\xe3\x7fO\xa6\xf5\x89\xa6\xe2\xca\xa2\xccs\x890XP2\x87\x87\xa4M\xc1\xe8\x94\x83\xda5;4\xa0_\xff+\xdfj\xfe\xd5\xc9xo\xe2c\x95\x91N\x85\xbcG>_9\xabJ9\x9f?\xf0\xc6\xdfr\x943Qv\xf1\xcf\xd8\x92V8\x15\xdcO\x1a\xfd\xdd\xe2\x0ct\xe9z\xe8\'\xa1\xab7\xd8\xdf\xe5\x99\xa8\x19"(\r\x1f);\xae\x17''',
b'''`\x95\x00\x95\xba\xe6M\xef\x9f\xdd\xc1\xe5\xc5\xd9'\\e\x88O\xc4\\\xf7\xd2\xe8J8xJ\x84\x0c\xb1\x1b\xe5>Y'\xd8\n\xf9U\xc0\xd5\xc8\xe8\xfc\xd5\xf8\xaf`1\xdc|$\x96\xdcin\x7f\x12\xa0\xbd\x03+_\x13o\xb1\x8a\x14\xc5\x0b\xb4\xa8\xfc\x12\xbd\xe3\x7fO\xa6\xf5\x89\xa6\xe2\xca\xa2\xccs\x890XP2\x87\x87\xa4M\xc1\xe8\x94\x83\xda5;4\xa0_\xff+\xdfj\xfe\xd5\xc9xo\xe2c\x95\x91N\x85\xbcG>_9\xabJ9\x9f?\xf0\xc6\xdfr\x943Qv\xf1\xcf\xd8\x92\x8a:\xd1\x8a\xeb]\x91\x89\xdb@\x931\xff\xb1}A\xb1\xc5\x98f\r\x019\x9f\x04n\xe9k#\x84d\xc8'''
]

 # Build dataframe
df = pd.DataFrame(data=arr)

len_without_encoding = len(df[0].str.decode('latin1').unique())
len_with_encoding = len(df[0].str.decode('latin1').str.encode('latin1').unique())
duplicates_without_encoding = df[0].str.decode('latin1').duplicated().sum()
duplicates_with_encoding = df[0].str.decode('latin1').str.encode('latin1').duplicated().sum()

print(f'Number of unique entries without encoding = {len_without_encoding}.')
print(f'Number of unique entries with encoding = {len_with_encoding}.')
print(f'Number of duplicate entries without encoding = {duplicates_without_encoding}.')
print(f'Number of duplicate entries with encoding = {duplicates_with_encoding}.')

Code output:

Number of unique entries without encoding = 1.
Number of unique entries with encoding = 2.
Number of duplicate entries without encoding = 0.
Number of duplicate entries with encoding = 0.
nadrods
  • 21
  • 2
  • 1
    Which characters do you think it's ignoring? Can you come up with shorter examples? – Barmar Jun 17 '23 at 22:16
  • 1
    Note that `\'` is the same as `'`. – Barmar Jun 17 '23 at 22:17
  • I suspect it's multiple encodings that represent the same character. – Barmar Jun 17 '23 at 22:35
  • Why are you trying to handle this data as strings? It would seem bytes would be more appropriate here. – Carbon Jun 17 '23 at 22:45
  • 2
    @Carbon This is a problem I stumbled on after leaving an errant .decode() lying around - I just couldn't get to the bottom of it and wanted to know why it was happening. The data is stored as bytes up to the point of decryption. – nadrods Jun 17 '23 at 22:50

1 Answers1

1

Series.unique() uses a hash table to identify duplicates.

>>> pd.util.hash_array(df[0].str.decode('latin1').values)

array([16742319041731200657, 16742319041731200657], dtype=uint64)

>>> pd.util.hash_array(df[0].str.decode('latin1').str.encode('latin1').values)

array([14404729582210184291,  4252367250525150497], dtype=uint64)

It appears that the Pandas hashing algorithm for strings does not like nulls:

>>> pd.util.hash_array(pd.Series(['A\x00B', 'B\x00C', 'A\x00C', 'B\x00B']).values)

array([5348837498266033693, 5426228885139917247, 5348837498266033693, 5426228885139917247], dtype=uint64)
Emilio Silva
  • 1,942
  • 14
  • 17
  • 2
    The hash algorithm of a `str` (Unicode string) seems to only calculate up to a null in the string. `df = pd.DataFrame(data=[b'A\x00A', b'A\x00B', b'B\x00B']); print(df[0].str.decode('latin1').unique())` only finds two unique values. – Mark Tolonen Jun 17 '23 at 23:05
  • 1
    Or more directly: `pd.Series (['A\x00B', 'A\x00C']).unique()` returns `array(['A\x00B'], dtype=object)`. That's clearly wrong. – Mark Tolonen Jun 17 '23 at 23:12
  • In ye ancient times, a null in a string was the terminator https://stackoverflow.com/questions/16627588/what-is-the-correct-string-terminator-in-c – Carbon Jun 17 '23 at 23:13
  • This discussion seems to explain the problem. Am I right in thinking that [this particular .pyx file](https://github.com/pandas-dev/pandas/blob/0bc16da1e53e9a4d637cf81c770e6517da65a029/pandas/_libs/hashing.pyx) is causing strings to be terminated early? – nadrods Jun 18 '23 at 00:15