4

In simple terms, I'm trying to add columns latitude and longitude from df1 to a smaller DataFrame called df2 by comparing the values from their air_id and hpg_id columns:

enter image description here

The trick to add latitude and longitude to df2 relies on how the comparison is made to df1, which could be one of 3 cases:

  • When there is a match between df2.air_id AND df1.air_hd;
  • When there is a match between df2.hpg_id AND df1.hpg_hd;
  • When there is a match between both of them: [df2.air_id, df2.hpg_id] AND [df1.air_hd, df1.hpg_id];

With that in mind, the expected result should be:

enter image description here

Notice how the ignore_me column from df1 was left out of the resulting DataFrame.

Here is the code to setup the DataFrames:

data = { 'air_id'     : [ 'air1',     '', 'air3', 'air4', 'air2', 'air1' ], 
         'hpg_id'     : [ 'hpg1', 'hpg2',     '', 'hpg4',     '',     '' ], 
         'latitude'   : [  101.1,  102.2,    103,    104,    102, 101.1, ],
         'longitude'  : [     51,     52,     53,     54,     52,    51, ],
         'ignore_me'  : [     91,     92,     93,     94,     95,    96 ] }

df1 = pd.DataFrame(data)
display(df1)


data2 = { 'air_id'    : [     '',  'air2', 'air3', 'air1' ], 
          'hpg_id'    : [ 'hpg1',  'hpg2',     '',     '' ]  }

df2 = pd.DataFrame(data2)
display(df2)

Unfortunately, I'm failing to use merge() for this task. My current result is a DataFrame with all columns from df1 mostly filled with NaNs:

enter image description here

How can I copy these specific columns from df1 using the rules above?

karlphillip
  • 92,053
  • 36
  • 243
  • 426
  • Is `hpg3` supposed to be in the `air_id` column? Is that a typo? I feel like your result doesn't make sense unless it's in the `hpg_id` column or am I missing something? – pault Mar 16 '18 at 22:29
  • It's a typo! Thanks for pointing that out. – karlphillip Mar 17 '18 at 00:15

3 Answers3

3

Using sets and Numpy broadcasting to handle the matching of stuff... sprinkled with fairy dust

ids = ['air_id', 'hpg_id']
cols = ['latitude', 'longitude']

def true(s): return s.astype(bool)

s2 = df2.stack().loc[true].groupby(level=0).apply(set)
s1 = df1[ids].stack().loc[true].groupby(level=0).apply(set)

i, j = np.where((s1.values & s2.values[:, None]).astype(bool))

a = np.zeros((len(df2), 2), int)
a[i, :] = df1[cols].values[j]

df2.join(pd.DataFrame(a, df2.index, cols))

  air_id hpg_id  latitude  longitude
0          hpg1       101         51
1   air2   hpg2       102         52
2   hpg3              103         53

Details

s2 looks like this

0          {hpg1}
1    {air2, hpg2}
2          {hpg3}
dtype: object

And s1

0    {air1, hpg1}
1          {hpg2}
2          {hpg3}
3    {air4, hpg4}
4          {air2}
dtype: object

The point is that we wanted to find if anything from that row matched with anything else in a row from the other data frame. Now I can use broadcasting and &

s1.values & s2.values[:, None]

array([[{'hpg1'}, set(), set(), set(), set()],
       [set(), {'hpg2'}, set(), set(), {'air2'}],
       [set(), set(), {'hpg3'}, set(), set()]], dtype=object)

But empty sets evaluate to False in a boolean context so

(s1.values & s2.values[:, None]).astype(bool)

array([[ True, False, False, False, False],
       [False,  True, False, False,  True],
       [False, False,  True, False, False]], dtype=bool)

And now I can use np.where to show me where these Trues are

i, j = np.where((s1.values & s2.values[:, None]).astype(bool))

print(i, j)

[0 1 1 2] [0 1 4 2]

Those are the the rows from df2 and df1 respectively. But I don't need two row 1 so I create an empty array of the appropriate size with the expectations that I'll overwrite row 1. I fill these values with the lats and lons from df1

a = np.zeros((len(df2), 2), int)
a[i, :] = df1[cols].values[j]

a

array([[101,  51],
       [102,  52],
       [103,  53]])

I then wrap this in a pd.DataFrame and join as we see above.

piRSquared
  • 285,575
  • 57
  • 475
  • 624
  • For the rounding, use dtype float when creating `a`. `a = np.zeros((len(df2), 2))` – piRSquared Mar 17 '18 at 01:20
  • I ended up selecting another answer as the official problem solver because of it's faster performance and robustness. Your solution took ~3min to process `df1` with **2343828** rows and `df2` with **10** rows. Nevertheless, I hope people can appreciate the great explanation you gave for this approach! I find it very educational and helpful. Thank you! – karlphillip Mar 18 '18 at 00:14
2

Here is one way to accomplish what you're trying to do.

First use merge() twice. First on air_id, then on hpg_id. For both, ignore the trivial case when the key is an empty string.

result = df2\
    .merge(
        df1[df1['air_id']!=''].drop(['hpg_id'], axis=1), on=['air_id'], how='left'
    )\
    .merge(
        df1[df1['hpg_id']!=''].drop(['air_id'], axis=1), on=['hpg_id'], how='left'
    )

print(result)
#  air_id hpg_id  ignore_me_x  latitude_x  longitude_x  ignore_me_y  \
#0          hpg1          NaN         NaN          NaN           91   
#1   air2   hpg2         92.0       102.0         52.0           92   
#2          hpg3          NaN         NaN          NaN           93   
#
#   latitude_y  longitude_y  
#0         101           51  
#1         102           52  
#2         103           53 

However, this creates duplicates for the columns you want. (I drop the other join key on each call to merge to avoid duplicate column names for those.)

We can coalesce these values by adapting one of the methods described on this post.

cols = ['latitude', 'longitude']
colsx = list(map(lambda c: c+"_x", cols))  # list needed for python3
colsy = list(map(lambda c: c+"_y", cols))  # list needed for python3
result[cols] = pd.DataFrame(
    np.where(result[colsx].isnull() == True, result[colsy], result[colsx])
)
result = result[['air_id', 'hpg_id'] + cols]
print(result)
#  air_id hpg_id  latitude  longitude
#0          hpg1     101.0       51.0
#1   air2   hpg2     102.0       52.0
#2   air3            103.0       53.0

UPDATE

In the case where the merge produces duplicate entries, you can use pandas.DataFrame.drop_duplicates().

result = result.drop_duplicates()
pault
  • 41,343
  • 15
  • 107
  • 149
  • @karlphillip are you using python3? If so you have to wrap the calls to `map` with `list` (I updated the answer). Here are working examples on repl.it: [python2 example](https://repl.it/repls/FuzzyObviousInstructions) and [python3 example](https://repl.it/repls/WeirdLinenSoftwareengineer). – pault Mar 17 '18 at 02:36
  • Great! How does it compare against the other answers in terms of performance? – pault Mar 17 '18 at 04:06
  • If you can update your question with an example case where it doesn't work, I can try to find a way to modify it. – pault Mar 17 '18 at 04:18
  • @karlphillip A simple fix for this example would be to just add `result = result.drop_duplicates()` at the end. Let me know if that solves your larger issue and I'll update my answer. – pault Mar 17 '18 at 16:14
  • Fantastic solution! On extreme cases where there is more than one entry with the same `air_id` or `hpg_id` and both occurrences have different latitude/longitude coordinates (for some weird reason), is add all the occurrences of that id (with different lat/lon coordinates) to the result table, which is absolutely great! In terms of performance, it took only 16s to process `df1` with **2343828** rows and `df2` with 10 rows. It's pretty incredible, congratulations! – karlphillip Mar 18 '18 at 00:09
1

This is one manual way without merges. It isn't efficient, but might be manageable if it performs adequately for your use case.

df1['lat_long'] = list(zip(df1['latitude'], df1['longitude']))

air = df1[df1['air_id'] != ''].set_index('air_id')['lat_long']
hpg = df1[df1['hpg_id'] != ''].set_index('hpg_id')['lat_long']

def mapper(row):
    myair, myhpg = row['air_id'], row['hpg_id']
    if (myair != '') and (myair in air):
        return air.get(myair)
    elif (myhpg != '') and (myhpg in hpg):
        return hpg.get(myhpg)
    elif (myair != '') and (myair in hpg):
        return hpg.get(myair)
    elif (myhpg != '') and (myhpg in air):
        return air.get(myhpg)
    else:
        return (None, None)

df2['lat_long'] = df2.apply(mapper, axis=1)
df2[['latitude', 'longitude']] = df2['lat_long'].apply(pd.Series)
df2 = df2.drop('lat_long', 1)

#   air_id hpg_id  latitude  longitude
# 0          hpg1       101         51
# 1   air2   hpg2       102         52
# 2   hpg3              103         53
jpp
  • 159,742
  • 34
  • 281
  • 339