I am working with datasets that include encrypted strings. Using pandas.unique() on the columns containing encrypted strings often returns fewer results than expected, especially when the strings are similar. After some experimenting, it seems that this is a problem with certain characters not being taken into consideration by the unique() function, but I can't figure out why that is. Interestingly, pandas.duplicated() works just fine where .unique() doesn't.
Example code which demonstrates the problem:
import pandas as pd
arr = [
b'''`\x95\x00\x95\xba\xe6M\xef\x9f\xdd\xc1\xe5\xc5\xd9\'\\e\x88O\xc4\\\xf7\xd2\xe8J8xJ\x84\x0c\xb1\x1b\xe5>Y\'\xd8\n\xf9U\xc0\xd5\xc8\xe8\xfc\xd5\xf8\xaf`1\xdc|$\x96\xdcin\x7f\x12\xa0\xbd\x03+_\x13o\xb1\x8a\x14\xc5\x0b\xb4\xa8\xfc\x12\xbd\xe3\x7fO\xa6\xf5\x89\xa6\xe2\xca\xa2\xccs\x890XP2\x87\x87\xa4M\xc1\xe8\x94\x83\xda5;4\xa0_\xff+\xdfj\xfe\xd5\xc9xo\xe2c\x95\x91N\x85\xbcG>_9\xabJ9\x9f?\xf0\xc6\xdfr\x943Qv\xf1\xcf\xd8\x92V8\x15\xdcO\x1a\xfd\xdd\xe2\x0ct\xe9z\xe8\'\xa1\xab7\xd8\xdf\xe5\x99\xa8\x19"(\r\x1f);\xae\x17''',
b'''`\x95\x00\x95\xba\xe6M\xef\x9f\xdd\xc1\xe5\xc5\xd9'\\e\x88O\xc4\\\xf7\xd2\xe8J8xJ\x84\x0c\xb1\x1b\xe5>Y'\xd8\n\xf9U\xc0\xd5\xc8\xe8\xfc\xd5\xf8\xaf`1\xdc|$\x96\xdcin\x7f\x12\xa0\xbd\x03+_\x13o\xb1\x8a\x14\xc5\x0b\xb4\xa8\xfc\x12\xbd\xe3\x7fO\xa6\xf5\x89\xa6\xe2\xca\xa2\xccs\x890XP2\x87\x87\xa4M\xc1\xe8\x94\x83\xda5;4\xa0_\xff+\xdfj\xfe\xd5\xc9xo\xe2c\x95\x91N\x85\xbcG>_9\xabJ9\x9f?\xf0\xc6\xdfr\x943Qv\xf1\xcf\xd8\x92\x8a:\xd1\x8a\xeb]\x91\x89\xdb@\x931\xff\xb1}A\xb1\xc5\x98f\r\x019\x9f\x04n\xe9k#\x84d\xc8'''
]
# Build dataframe
df = pd.DataFrame(data=arr)
len_without_encoding = len(df[0].str.decode('latin1').unique())
len_with_encoding = len(df[0].str.decode('latin1').str.encode('latin1').unique())
duplicates_without_encoding = df[0].str.decode('latin1').duplicated().sum()
duplicates_with_encoding = df[0].str.decode('latin1').str.encode('latin1').duplicated().sum()
print(f'Number of unique entries without encoding = {len_without_encoding}.')
print(f'Number of unique entries with encoding = {len_with_encoding}.')
print(f'Number of duplicate entries without encoding = {duplicates_without_encoding}.')
print(f'Number of duplicate entries with encoding = {duplicates_with_encoding}.')
Code output:
Number of unique entries without encoding = 1.
Number of unique entries with encoding = 2.
Number of duplicate entries without encoding = 0.
Number of duplicate entries with encoding = 0.