Some time ago I asked this question about reverting an encoding done to feed the data to a machine learning model.
At the time the answer was enough, but now I need more. I have some data of molar masses with noise, and want to encode it to the element symbols of the different compositions. Here we have an example of how it can be done:
import pandas as pd
import random as rand
#Create DataFrame
df = pd.DataFrame({'col1':['one','two','two','one', 0.151],'col2':[0.2,0.2,0.2,0.2,0.2],'col3':[0.3,0.3,0.3,0.3,0.3], 'col4':[0.4,0.4,0.4,0.4,0.4]})
print(df)
#Create Simple encoding and replace with it
encoding = {'one': 0.1, 'two': 0.2}
df.replace({'col1':encoding}, inplace=True)
print(df)
#Add some noise, like the one we could find in real life
df['col1'] = df['col1'] + [rand.randint(-100,100)/10000000 for _ in range(df.shape[0])]
print(df)
#Find out the closest encoding with the noise
ms = [abs(df['col1'] - encoding[i]) for i in encoding]
print(ms)
#Revert the encoding
enc = []
for row in range(len(ms[0])):
m = min(ms[0][row], ms[1][row])
if m == ms[0][row]:
enc.append(list(encoding.keys())[0])
elif m == ms[1][row]:
enc.append(list(encoding.keys())[1])
else:
enc.append('NotEncoded')
print(enc)
#Assign the reverted encoding to the df
df['col1'] = enc
print(df)
The problem with this solution I've come with is that it's hard encoded and does not translate well to the real data (30+ keys in the dictionary, hundreds of rows to check), so I'm looking for a way with less hard-encoding and more functional approach to the problem.
Any help will be welcome,
Thanks.