0

I have a dictionary of dictionaries and each contains a mapping for each column of my dataframe.

My goal is to find the most efficient way to perform mapping for my dataframe with 1 row and 300 columns.

My dataframe is randomly sampled from range(mapping_size); and my dictionaries map values from range(mapping_size) to random.randint(mapping_size+1,mapping_size*2).

I can see from the answer provided by jpp that map is possibly the most efficient way to go but I am looking for something which is even faster than map. Can you think of any? I am happy if the data structure of the input is something else instead of pandas dataframe.

Here is the code for setting up the question and results using map and replace:

# import packages
import random
import pandas as pd
import numpy as np
import timeit

# specify paramters
ncol = 300 # number of columns
nrow =  1 #number of rows
mapping_size = 10 # length of each dictionary

# create a dictionary of dictionaries for mapping
mapping_dict = {}

random.seed(123)

for idx1 in range(ncol):
    # create empty dictionary
    mapping_dict['col_' + str(idx1)] = {}
    for inx2 in range(mapping_size):
        # create dictionary of length mapping_size and maps value from range(mapping_size) to  random.randint(mapping_size +1 ,mapping_size*2)
        mapping_dict['col_' + str(idx1)][inx2+1] = random.randint(mapping_size+1,mapping_size*2)
        
# Create a dataframe with values sampled from range(mapping_size)
d={}

random.seed(123)

for idx1 in range(ncol):
    d['col_' + str(idx1)] = np.random.choice(range(mapping_size),nrow)
    
df = pd.DataFrame(data=d)

Results using map and replace:

%%timeit -n 20
df.replace(mapping_dict) #296 ms

%%timeit -n 20
for key in mapping_dict.keys():
    df[key] = df[key].map(mapping_dict[key]).fillna(df[key]) #221ms

%%timeit -n 20
for key in mapping_dict.keys():
    df[key] = df[key].map(mapping_dict[key]) #181ms
Daves
  • 175
  • 1
  • 10

1 Answers1

0

Just use pandas without python for iteration.

# runtime  ~ 1s (1000rows)

# creat a map_serials with multi_index
df_dict = pd.DataFrame(mapping_dict)
obj_dict = df_dict.T.stack()

# obj_dict

    # col_0    1     10
    #          2     14
    #          3     11
    # Length: 3000, dtype: int64

# convert df to map_serials's index, df can have more then 1 row
obj_idx = pd.Series(df.values.flatten())
obj_idx.index = pd.Index(df.columns.to_list() * df.shape[0])
idx = obj_idx.to_frame().reset_index().set_index(['index', 0]).index
result = obj_dict[idx]

# handle null values
cond = result.isnull()
result[cond] = pd.Series(result[cond].index.values).str[1].values

# transform to reslut DataFrame
df_result = pd.DataFrame(result.values.reshape(df.shape))
df_result.columns = df.columns

df_result
Ferris
  • 5,325
  • 1
  • 14
  • 23
  • Thanks for getting back! I got an error from `result = obj_dict[idx]` saying `Passing list-likes to .loc or [] with any missing labels is no longer supported. ` – Daves Dec 09 '20 at 10:49
  • Also I did try running using `map` with 1000rows and it took 287 ms so much faster 1s – Daves Dec 09 '20 at 10:51