Faster way to query & compute in Pandas

Question

I have two dataframes in Pandas. What I want achieve is, grab every 'Name' from DF1 and get the corresponding 'City' and 'State' present in DF2.

For example, 'Dwight' from DF1 should return corresponding values 'Miami' and 'Florida' from DF2.

DF1

         Name     Age  Student
0        Dwight   20   Yes
1        Michael  30   No
2        Pam      55   No
.  .        .    .
70000    Jim      27   Yes

DF1 has approx 70,000 rows with 3 columns

Second Dataframe, DF2 has approx 320,000 rows.

         Name     City       State
0        Dwight   Miami      Florida
1        Michael  Scranton   Pennsylvania
2        Pam      Austin     Texas
.  .        .    .           .
325082    Jim      Scranton   Pennsylvania

Currently I have two functions, which return the values of 'City' and 'State' using a filter.

def read_city(id):
    filt = (df2['Name'] == id)
    if filt.any():
        field = (df2[filt]['City'].values[0])
    else:
        field = ""
    return field


def read_state(id):
    filt = (df2['Name'] == id)
    if filt.any():
        field = (df2[filt]['State'].values[0])
    else:
        field = ""
    return field

I am using the apply function to process all the values.

df['city_list'] = df['Name'].apply(read_city)
df['State_list'] = df['Name'].apply(read_state)

The result takes a long time to compute in the above way. It roughly takes me around 18 minutes to get back the df['city_list'] and df['State_list'].

Is there a faster to compute this ? Since I am completely new to pandas, I would like to know if there is a efficient way to compute this ?

You could just merge the two if you are trying to get all the information in one place for a given name: `pd.merge(df1, df2, how='left', on=['Name'])` — noah, Oct 15 '20 at 18:32

score 0 · Answer 1 · answered Oct 15 '20 at 18:34

0

I believe you can do a map:

s = df2.groupby('name')[['City','State']].agg(list)
df['city_list'] = df['Name'].map(s['City'])
df['State_list'] = df['Name'].map(s['State'])

Or a left merge after you got s:

df = df.merge(s.add_suffix('_list'), left_on='Name', right_index=True, how='left')

answered Oct 15 '20 at 18:34

Quang Hoang

146,074
10
56
74

This also worked great, thank you @Quang – Tushar Oct 19 '20 at 08:12

score 0 · Answer 2 · answered Oct 16 '20 at 02:40

I think you can do something like this:

# Dataframe DF1 (dummy data)

DF1 = pd.DataFrame(columns=['Name', 'Age', 'Student'], data=[['Dwight', 20, 'Yes'], ['Michael', 30, 'No'], ['Pam', 55, 'No'], ['Jim', 27, 'Yes']])

print("DataFrame DF1")
print(DF1)

# Dataframe DF2 (dummy data)

DF2 = pd.DataFrame(columns=['Name', 'City', 'State'], data=[['Dwight', 'Miami', 'Florida'], ['Michael', 'Scranton', 'Pennsylvania'], ['Pam', 'Austin', 'Texas'], ['Jim', 'Scranton', 'Pennsylvania']])

print("DataFrame DF2")
print(DF2)

# You do a merge on 'Name' column and then, you change the name of columns 'City' and 'State'
df = pd.merge(DF1, DF2, on=['Name']).rename(columns={'City': 'city_list', 'State': 'State_list'})
print("DataFrame final")
print(df)

Output:

DataFrame DF1
Name       Age  Student
0   Dwight  20  Yes
1   Michael 30  No
2   Pam     55  No
3   Jim     27  Yes

DataFrame DF2
Name        City       State
0   Dwight  Miami       Florida
1   Michael Scranton    Pennsylvania
2   Pam     Austin      Texas
3   Jim     Scranton    Pennsylvania

DataFrame final
Name       Age  Student city_list   State_list
0   Dwight  20  Yes     Miami       Florida
1   Michael 30  No      Scranton    Pennsylvania
2   Pam     55  No      Austin      Texas
3   Jim     27  Yes     Scranton    Pennsylvania

Faster way to query & compute in Pandas

2 Answers2