0

Some time ago I asked this question about reverting an encoding done to feed the data to a machine learning model.

At the time the answer was enough, but now I need more. I have some data of molar masses with noise, and want to encode it to the element symbols of the different compositions. Here we have an example of how it can be done:


import pandas as pd
import random as rand

#Create DataFrame
df = pd.DataFrame({'col1':['one','two','two','one', 0.151],'col2':[0.2,0.2,0.2,0.2,0.2],'col3':[0.3,0.3,0.3,0.3,0.3], 'col4':[0.4,0.4,0.4,0.4,0.4]})
print(df)

#Create Simple encoding and replace with it
encoding = {'one': 0.1, 'two': 0.2}
df.replace({'col1':encoding}, inplace=True)
print(df)

#Add some noise, like the one we could find in real life
df['col1'] = df['col1'] + [rand.randint(-100,100)/10000000 for _ in range(df.shape[0])]
print(df)

#Find out the closest encoding with the noise
ms = [abs(df['col1'] - encoding[i]) for i in encoding]
print(ms)

#Revert the encoding
enc = []
for row in range(len(ms[0])):
    m = min(ms[0][row], ms[1][row])
    if m == ms[0][row]:
        enc.append(list(encoding.keys())[0])
    elif m == ms[1][row]:
        enc.append(list(encoding.keys())[1])
    else:
        enc.append('NotEncoded')
print(enc)

#Assign the reverted encoding to the df
df['col1'] = enc

print(df)

The problem with this solution I've come with is that it's hard encoded and does not translate well to the real data (30+ keys in the dictionary, hundreds of rows to check), so I'm looking for a way with less hard-encoding and more functional approach to the problem.

Any help will be welcome,

Thanks.

1 Answers1

0

Talking with some people in the python discord, We've come up with this solution:

df_encoding = pd.Series(encoding, name='mapped').to_frame().reset_index().rename(columns={'index': 'source'})
print('Solution')
df_rencoded = pd.merge_asof(df.sort_values('col1'), df_encoding, left_on=['col1'], right_on=['mapped'], direction='nearest')
print(df_rencoded)
df_rencoded.loc[:,'col1'] = df_rencoded['source']
df_rencoded.drop(['source', 'mapped'], axis=1, inplace=True)
print(df_rencoded)

Here, the first step: df_encoding = pd.Series(encoding, name='mapped').to_frame().reset_index().rename(columns={'index': 'source'}) creates a series with the encoding, transforms it to a dataframe (to join it later) and resets the index so the index column is counted as a column of the dataframe. Finally, with the rename() method we change the name of the index column to source.

Then we do the merge_asof. This requires both dataframes to be sorted by the respective join columns.

Lastly, we eliminate the new added columns after changing the data in the original columns.