I have a function that hashes int values that exist in a pd.DataFrame
's column some_id
. My problem is that at the end, I am forced to use a pd.Series.apply
in order to use md5(x).hexdigest()
. If the apply(f)
is called on a dataframe of considerable row size, it will cause tremendous performance problems (time + memory), because function calls in python are horrendous. Therefore, if it is applied on 10 million rows... OOM kill
is inevitable. (I have tried)
The following function does exactly what it should, the part to change is the one marked bellow the #TODO
import pandas as pd
import numpy as np
from hashlib import md5
from pandas.api.types import is_numeric_dtype
def hash_some_id(data: pd.DataFrame) -> pd.DataFrame:
# Remove all na and 0 values in order to encode the rest
data_to_encode = data.copy()
data_to_encode = data.replace(0, np.nan)
data_to_encode.dropna(inplace=True)
if is_numeric_dtype(data_to_encode.some_id):
# Convert the int some_id to str and prepend 0 to it
data_to_encode.some_id = "0" + data_to_encode.some_id.astype(str)
# Encode to binary
data_to_encode.some_id = data_to_encode.some_id.str.encode('ascii')
# TODO : Optimize this
# This is inefficient as it is applied on each row of data.
# Find a way to vectorize it
encoded_data = data_to_encode.some_id.apply(lambda x : md5(x).hexdigest())
return encoded_data