I have a dictionary of dictionaries and each contains a mapping for each column of my dataframe.
My goal is to find the most efficient way to perform mapping for my dataframe with 1 row and 300 columns.
My dataframe is randomly sampled from range(mapping_size)
; and my dictionaries map values from range(mapping_size)
to random.randint(mapping_size+1,mapping_size*2)
.
I can see from the answer provided by jpp that map
is possibly the most efficient way to go but I am looking for something which is even faster than map
. Can you think of any? I am happy if the data structure of the input is something else instead of pandas dataframe.
Here is the code for setting up the question and results using map
and replace
:
# import packages
import random
import pandas as pd
import numpy as np
import timeit
# specify paramters
ncol = 300 # number of columns
nrow = 1 #number of rows
mapping_size = 10 # length of each dictionary
# create a dictionary of dictionaries for mapping
mapping_dict = {}
random.seed(123)
for idx1 in range(ncol):
# create empty dictionary
mapping_dict['col_' + str(idx1)] = {}
for inx2 in range(mapping_size):
# create dictionary of length mapping_size and maps value from range(mapping_size) to random.randint(mapping_size +1 ,mapping_size*2)
mapping_dict['col_' + str(idx1)][inx2+1] = random.randint(mapping_size+1,mapping_size*2)
# Create a dataframe with values sampled from range(mapping_size)
d={}
random.seed(123)
for idx1 in range(ncol):
d['col_' + str(idx1)] = np.random.choice(range(mapping_size),nrow)
df = pd.DataFrame(data=d)
Results using map
and replace
:
%%timeit -n 20
df.replace(mapping_dict) #296 ms
%%timeit -n 20
for key in mapping_dict.keys():
df[key] = df[key].map(mapping_dict[key]).fillna(df[key]) #221ms
%%timeit -n 20
for key in mapping_dict.keys():
df[key] = df[key].map(mapping_dict[key]) #181ms