2

I know working on numpy array can be quicker than pandas.

I am wondering if there is a equivalent way (and quicker) to do pandas.replace on a numpy array.

In the example below, I have created a dataframe and a dictionary. The dictionary contains the name of columns and its corresponding mapping. I wonder if there is any function which would allow me to feed a dicitonary to a numpy array to do the mapping and yield a quicker processing time?

import pandas as pd
import numpy as np

# Dataframe
d = {'col1': [1, 2, 3], 'col2': [4, 5, 6]}
df = pd.DataFrame(data=d)

# dictionary I want to map
d_mapping = {'col1' : {1:2 , 2:1} ,  'col2' : {4:1}}

# result using pandas replace
print(df.replace(d_mapping))

# Instead of a pandas dataframe, I want to perform the same operation on a numpy array
df_np =  df.to_records(index=False)
Daves
  • 175
  • 1
  • 10
  • have a look on: https://stackoverflow.com/questions/16992713/translate-every-element-in-numpy-array-according-to-key – Anurag Dabas May 24 '21 at 01:42
  • [related](https://stackoverflow.com/questions/57484396/vectorizing-a-pure-function-with-numpy-assuming-many-duplicates). My intuition is that beating pandas with numpy will be difficult. – hilberts_drinking_problem May 24 '21 at 02:26
  • @AnuragDabas Thanks! I did have a look of that and that scenario applies the same dictionary to the entire matrix. For mine, I would like to have different dictionary for different columns – Daves May 24 '21 at 09:26

1 Answers1

0

You can try np.select(). I believe it depends on the number of unique elements to replace.

def replace_values(df, d_mapping):
    def replace_col(col):
        # extract numpy array and column name from pd.Series
        col, name = col.values, col.name
        # generate condlist and choicelist
        # for every key in mapping create a boolean mask
        condlist = [col == x for x in d_mapping[name].keys()]
        choicelist = d_mapping[name].values()
        # use np.where to keep the existing value which won't be replaced 
        return np.select(condlist, choicelist, col)

    return df.apply(replace_col)

usage:

replace_values(df, d_mapping)

I also believe that you you can speed up the code above if you use lists/arrays in the mapping instead of dicts and replace keys(), and values() calls with index lookups:

d_mapping = {"col1": [[1, 2], [2, 1]], "col2": [[4], [1]]}
...
lookups and are also expensive
m = d_mapping[name]
condlist = [col == x for x in m[0]]
choicelist = m[1]
...
np.isin(col, m[0]),

Upd:

Here is the benchmark

import pandas as pd
import numpy as np

# Dataframe
df = pd.DataFrame({"col1": [1, 2, 3], "col2": [4, 5, 6]})

# dictionary I want to map
d_mapping = {"col1": [[1, 2], [2, 1]], "col2": [[4], [1]]}
d_mapping_2 = {
    col: dict(zip(*replacement)) for col, replacement in d_mapping.items()
}


def replace_values(df, mapping):
    def replace_col(col):
        col, (m0, m1) = col.values, mapping[col.name]
        return np.select([col == x for x in m0], m1, col)

    return df.apply(replace_col)


from timeit import timeit

print("np.select: ", timeit(lambda: replace_values(df, d_mapping), number=5000))
print("df.replace: ", timeit(lambda: df.replace(d_mapping_2), number=5000))

On my 6-year old laptop it prints:

np.select:  3.6562702230003197
df.replace:  4.714512745998945

np.select is ~20% faster

Alexander Volkovsky
  • 2,588
  • 7
  • 13