1

I have a Pandas DataFrame where I need to add new columns of data from lookup Dictionaries. I am looking for the fastest way to do this. I have a way that works using DataFrame.map() with a lambda but I wanted to know if this was the best practice and best performance I could achieve. I am used to doing with work with R and the excellent data.table library. I am working in a Jupyter notebook which is what is letting me use %time on the final line.

Here is what I have:

import numpy as np
import pandas as pd

np.random.seed(123)
num_samples = 100_000_000

ids = np.arange(0, num_samples)
states = ['Oregon', 'Michigan']
cities = ['Portland', 'Detroit']

state_data = {
    0:{'Name': 'Oregon', 'mean': 100, 'std_dev': 5},
    1:{'Name': 'Michigan', 'mean':90, 'std_dev': 8}
}

city_data = {
    0:{'Name': 'Portland', 'mean': 8, 'std_dev':3},
    1:{'Name': 'Detroit','mean': 4, 'std_dev':3}
}

state_df = pd.DataFrame.from_dict(state_data,orient='index')
print(state_df)

city_df = pd.DataFrame.from_dict(city_data,orient='index')
print(city_df)

sample_df = pd.DataFrame({'id':ids})
sample_df['state_id'] = np.random.randint(0, 2, num_samples)
sample_df['city_id'] = np.random.randint(0, 2, num_samples)
%time sample_df['state_mean'] = sample_df['state_id'].map(state_data).map(lambda x : x['mean'])

The last line is what I am most focused on.

I have also tried the following but saw no significant performance difference:

%time sample_df['state_mean'] = sample_df['state_id'].map(lambda x : state_data[x]['mean'])

What I ultimately want is to get sample_df to have columns for each of the states and cities. So I would have the following columns in the table:

id | state | state_mean | state_std_dev | city | city_mean | city_std_dev
halfer
  • 19,824
  • 17
  • 99
  • 186
Matthew Crews
  • 4,105
  • 7
  • 33
  • 57

1 Answers1

1

Use DataFrame.join if you want add all columns:

sample_df = sample_df.join(state_df,on = 'state_id')


#        id  state_id  city_id      Name  mean  std_dev
#0        0         0        0    Oregon   100        5
#1        1         1        1  Michigan    90        8
#2        2         0        0    Oregon   100        5
#3        3         0        0    Oregon   100        5
#4        4         0        0    Oregon   100        5
#...    ...       ...      ...       ...   ...      ...
#9995  9995         1        0  Michigan    90        8
#9996  9996         1        1  Michigan    90        8
#9997  9997         0        1    Oregon   100        5
#9998  9998         1        1  Michigan    90        8
#9999  9999         1        0  Michigan    90        8

for one column

sample_df['state_mean'] = sample_df['state_id'].map(state_df['mean'])
ansev
  • 30,322
  • 5
  • 17
  • 31
  • What is interesting is that I'm seeing that it is way faster to do it one column at a time, per your question in the original question, instead of using `join`. I'm curious as to why that is. Perhaps not as much memory alloc? – Matthew Crews Feb 13 '20 at 19:33
  • using map? I don't think so – ansev Feb 13 '20 at 19:37
  • Um, yes? This takes 5 sec on my machine `sample_df['state_mean'] = sample_df['state_id'].map(state_df['mean'])` `sample_df['state_std_dev'] = sample_df['state_id'].map(state_df['std_dev'])` This takes 12.3 sec: `sample_df = sample_df.join(state_df,on = 'state_id', rsuffix='state')` – Matthew Crews Feb 13 '20 at 19:41
  • 1
    I don't know what will happen with more columns and that they don't contain only 1 and 0, but I guess join must be optimized for that, anyway the optimal way depends on the application you need :) – ansev Feb 13 '20 at 19:44
  • 1
    Could you explain what is going on in the `map` function? I'm slightly confused. Does it just know to use the index of the `state_df` to find the corresponding row? – Matthew Crews Feb 13 '20 at 19:54
  • 1
    Exactly we can map a series with a dictionary but also with another series. In the latter case, the values ​​of one series are mapped using the index of the other series. – ansev Feb 13 '20 at 19:56
  • 1
    when we use `DataFrame.join` we do the same really only that we can join another dataframe (i.e. more columns). We can also use `DataFrame.merge` to match two DataFrame based on the columns (and not by the index as with `join`) https://stackoverflow.com/questions/53645882/pandas-merging-101 – ansev Feb 13 '20 at 20:00