2

I would like to know the pandas alternative to an Excel Vlookup when it involves two dataframes and a one to many relationship. I have searched for this answer and don't think i found one that solves my use-case. below are a few post that are relevant, but not exactly what I need.

Excel VLOOKUP equivalent in pandas

vlookup between 2 Pandas dataframes

Situation:

I have two data frames and a key that links both (site). I would use the pandas merge function, but I don't want multiple records returned for one key (location B in this example).

Specifically, I want to return the status of a site from the loc_status dataframe if it exists. If the location has both a status of "inactive" and "active" then I only want to return "active".

Here is a basic example:

loc_status = [{'site':'A', 'status':'ACTIVE'}, {'site':'B',         'status':'ACTIVE'},{'site':'B', 'status':'INACTIVE'},{'site':'C', 'status':'INACTIVE'} ]

loc = [{'site':'A'}, {'site':'B'},{'site':'C'}, {'site':'D'} ]

df_status = pd.DataFrame(loc_status)

+----+-------+----------+
|    | site  |  status  |
+----+-------+----------+
| 0  | A     | ACTIVE   |
| 1  | B     | ACTIVE   |
| 2  | B     | INACTIVE |
| 3  | C     | INACTIVE |
+----+-------+----------+

df_loc = pd.DataFrame(loc)

+------+---+
| site |   |
+------+---+
|   0  | A |
|   1  | B |
|   2  | C |
|   3  | D |
+------+---+

result = [{'site':'A', 'status':'ACTIVE'}, {'site':'B', 'status':'ACTIVE'},{'site':'C', 'status':'INACTIVE'}, {'site': 'D'}]

df_result = pd.DataFrame(result)

+----+-------+----------+
|    | site  |  status  |
+----+-------+----------+
| 0  | A     | ACTIVE   |
| 1  | B     | ACTIVE   |
| 2  | C     | INACTIVE |
| 3  | D     | NaN      |
+----+-------+----------+

Thank you.

Community
  • 1
  • 1
Benjamin Neil
  • 71
  • 1
  • 8

3 Answers3

1

First sort_values by both columns, so for each site if duplicates status always ACTIVE is first. Then drop_duplicates with default keep=first - get only ACTIVE if duplicates.

Last map by Series created with set_index:

df_status = df_status.sort_values(['site','status']).drop_duplicates(['site'])
print (df_status)
  site    status
0    A    ACTIVE
1    B    ACTIVE
3    C  INACTIVE

df_loc['status'] = df_loc['site'].map(df_status.set_index('site')['status'])
print (df_loc)
  site    status
0    A    ACTIVE
1    B    ACTIVE
2    C  INACTIVE
3    D       NaN

Timings:

#jezrael solution
In [136]: %timeit df_loc['status_jez'] = df_loc['site'].map(df_status.sort_values(['site','status']).drop_duplicates(['site']).set_index('site')['status'])
10 loops, best of 3: 67.3 ms per loop

#Allen solution
In [137]: %timeit pd.merge(df_loc,df_status.sort_values(['site','status']).groupby(by='site').first().reset_index(),how='left')
10 loops, best of 3: 114 ms per loop

#piRSquared solution
In [138]: %timeit df_loc.assign(status_pir=df_loc.site.map(df_status.loc[(df_status.status == 'ACTIVE').groupby(df_status.site).idxmax()].set_index('site').status))
1 loop, best of 3: 3.37 s per loop

Code for timings:

np.random.seed(123)
N = 100000
L = np.random.randint(100000,size=N)
df_status = pd.DataFrame({'site': np.random.choice(L, N),
                         'status':np.random.choice(['ACTIVE','INACTIVE'],N)})
print (df_status.head(10))

df_loc = pd.DataFrame({'site':L})
print (df_loc.head(10))
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
0
#remove Inactive rows if there's an active row for a certain site.
df_status = df_status.sort_values(['site','status']).groupby(by='site').first().reset_index()
#join loc and status df.
pd.merge(df_loc,df_status,how='left')

Out[108]: 
  site    status
0    A    ACTIVE
1    B    ACTIVE
2    C  INACTIVE
3    D       NaN
Allen Qin
  • 19,507
  • 8
  • 51
  • 67
0
  • Define the status series of boolean values determining if status == 'ACTIVE'
  • Define the site series for convenience
  • By grouping status by site and caclulating idxmax, I find the fir index that has a status of ACTIVE if it has one at all.
  • With the idx I can slice df_status with unique values of site with the preference for ACTIVE
  • By setting the index and taking the status column, I've created a pd.Series which will act dict like for the purposes of map
  • Use assign + map to create new column

status = df_status.status == 'ACTIVE'
site = df_status.site
idx = status.groupby(site).idxmax()
m = df_status.loc[idx].set_index('site').status

df_loc.assign(status=df_loc.site.map(m))

  site    status
0    A    ACTIVE
1    B    ACTIVE
2    C  INACTIVE
3    D       NaN
piRSquared
  • 285,575
  • 57
  • 475
  • 624