I have a Pandas DataFrame where I need to add new columns of data from lookup Dictionaries. I am looking for the fastest way to do this. I have a way that works using DataFrame.map()
with a lambda but I wanted to know if this was the best practice and best performance I could achieve. I am used to doing with work with R
and the excellent data.table
library. I am working in a Jupyter notebook which is what is letting me use %time
on the final line.
Here is what I have:
import numpy as np
import pandas as pd
np.random.seed(123)
num_samples = 100_000_000
ids = np.arange(0, num_samples)
states = ['Oregon', 'Michigan']
cities = ['Portland', 'Detroit']
state_data = {
0:{'Name': 'Oregon', 'mean': 100, 'std_dev': 5},
1:{'Name': 'Michigan', 'mean':90, 'std_dev': 8}
}
city_data = {
0:{'Name': 'Portland', 'mean': 8, 'std_dev':3},
1:{'Name': 'Detroit','mean': 4, 'std_dev':3}
}
state_df = pd.DataFrame.from_dict(state_data,orient='index')
print(state_df)
city_df = pd.DataFrame.from_dict(city_data,orient='index')
print(city_df)
sample_df = pd.DataFrame({'id':ids})
sample_df['state_id'] = np.random.randint(0, 2, num_samples)
sample_df['city_id'] = np.random.randint(0, 2, num_samples)
%time sample_df['state_mean'] = sample_df['state_id'].map(state_data).map(lambda x : x['mean'])
The last line is what I am most focused on.
I have also tried the following but saw no significant performance difference:
%time sample_df['state_mean'] = sample_df['state_id'].map(lambda x : state_data[x]['mean'])
What I ultimately want is to get sample_df
to have columns for each of the states and cities. So I would have the following columns in the table:
id | state | state_mean | state_std_dev | city | city_mean | city_std_dev