Using pandas to optimize the hashing applied on a Series

Question

I have a function that hashes int values that exist in a pd.DataFrame's column some_id. My problem is that at the end, I am forced to use a pd.Series.apply in order to use md5(x).hexdigest(). If the apply(f) is called on a dataframe of considerable row size, it will cause tremendous performance problems (time + memory), because function calls in python are horrendous. Therefore, if it is applied on 10 million rows... OOM kill is inevitable. (I have tried)

The following function does exactly what it should, the part to change is the one marked bellow the #TODO

import pandas as pd
import numpy as np
from hashlib import md5
from pandas.api.types import is_numeric_dtype

def hash_some_id(data: pd.DataFrame) -> pd.DataFrame:
    # Remove all na and 0 values in order to encode the rest
    data_to_encode = data.copy()
    data_to_encode = data.replace(0, np.nan)
    data_to_encode.dropna(inplace=True)
    if is_numeric_dtype(data_to_encode.some_id):
        # Convert the int some_id to str and prepend 0 to it
        data_to_encode.some_id = "0" + data_to_encode.some_id.astype(str)
        # Encode to binary
        data_to_encode.some_id = data_to_encode.some_id.str.encode('ascii')

        # TODO : Optimize this
        #  This is inefficient as it is applied on each row of data. 
        #  Find a way to vectorize it
        encoded_data = data_to_encode.some_id.apply(lambda x : md5(x).hexdigest())

    return encoded_data

@Siddharth Well, after reading about `map`,`apply` and `applymap`, I realise they operate in the same fashion? Row-wise application of a function. Is that true? — Imad, May 26 '19 at 15:45
I am also not sure but you are operating on a series `data_to_encode.some_id`, that's why I suggested to use `pd.Series.map`. This should help for the difference https://stackoverflow.com/questions/19798153/difference-between-map-applymap-and-apply-methods-in-pandas. Apart from that I guess you may try https://dask.org/ — Siddharth, May 27 '19 at 07:29
@Siddarth : I thought about it, but I want to make sure I exhaust all possibilities of pandas before changing my code completely into `dask` — Imad, May 27 '19 at 15:47

Using pandas to optimize the hashing applied on a Series

0 Answers0