4

I have written some code that replaces values in a DataFrame with values from another frame using a dictionary, and it is working, but i am using this on some large files, where the dictionary can get very long. A few thousand pairs. When I then uses this code it runs very slow, and it have also been going out of memory on a few ocations.

I am somewhat convinced that my method of doing this is far from optimal, and that there must be some faster ways to do this. I have created a simple example that does what I want, but that is slow for large amounts of data. Hope someone have a simpler way to do this.

import pandas as pd

#Frame with data where I want to replace the 'id' with the name from df2
df1 = pd.DataFrame({'id' : [1, 2, 3, 4, 5, 3, 5, 9], 'values' : [12, 32, 42,    51, 23, 14, 111, 134]})

#Frame containing names linked to ids
df2 = pd.DataFrame({'id' : [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], 'name' : ['id1',   'id2', 'id3', 'id4', 'id5', 'id6', 'id7', 'id8', 'id9', 'id10']})

#My current "slow" way of doing this.

#Starts by creating a dictionary from df2
#Need to create dictionaries from the domain and banners tables to link ids
df2_dict = dict(zip(df2['id'], df2['name']))

#and then uses the dict to replace the ids with name in df1
df1.replace({'id' : df2_dict}, inplace=True)
Siesta
  • 451
  • 7
  • 21

1 Answers1

1

I think you can use map with Series converted to_dict - get NaN if not exist value in df2:

df1['id'] = df1.id.map(df2.set_index('id')['name'].to_dict())
print (df1)
    id  values
0  id1      12
1  id2      32
2  id3      42
3  id4      51
4  id5      23
5  id3      14
6  id5     111
7  id9     134

Or replace, if dont exist value in df2 let original values from df1:

df1['id'] = df1.id.replace(df2.set_index('id')['name'])
print (df1)
    id  values
0  id1      12
1  id2      32
2  id3      42
3  id4      51
4  id5      23
5  id3      14
6  id5     111
7  id9     134

Sample:

#Frame with data where I want to replace the 'id' with the name from df2
df1 = pd.DataFrame({'id' : [1, 2, 3, 4, 5, 3, 5, 9], 'values' : [12, 32, 42,    51, 23, 14, 111, 134]})
print (df1)
#Frame containing names linked to ids
df2 = pd.DataFrame({'id' : [1, 2, 3, 4, 6, 7, 8, 9, 10], 'name' : ['id1',   'id2', 'id3', 'id4', 'id6', 'id7', 'id8', 'id9', 'id10']})
print (df2)

df1['new_map'] = df1.id.map(df2.set_index('id')['name'].to_dict())
df1['new_replace'] = df1.id.replace(df2.set_index('id')['name'])
print (df1)
   id  values new_map new_replace
0   1      12     id1         id1
1   2      32     id2         id2
2   3      42     id3         id3
3   4      51     id4         id4
4   5      23     NaN           5
5   3      14     id3         id3
6   5     111     NaN           5
7   9     134     id9         id9
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • This seems to be working. But is there a way to keep the 'values' column in df1. I dont seem to be able to figure out how to write this to just change the id column and keep the values column. Nvm, just figured it out. Could just do this: df1['id'].replace(df2.set_index('id')['name'], inplace=True) – Siesta Nov 10 '16 at 13:34
  • Sorry, I dont add asign, please see update my answer. – jezrael Nov 10 '16 at 13:37