1

Simply put, I'm trying to compare values from 2 columns of the first DataFrame with the same columns in another DataFrame. The indexes of the matched rows are stored as a new column in first DataFrame.

Let me explain: I'm working with geographical features (latitude/longitude) and the main DataFrame -- called df -- has something like 55M observations, which look a little bit like this:

enter image description here

As you can see, there are only two rows with data that looks legit (indexes 2 and 4).

The second DataFrame -- called legit_df --, is much much smaller and has all the geographical data I consider legit:

enter image description here

Without getting into WHY, the main task involves comparing each latitude/longitude observation from df to the data of legit_df. When there's a successful match, the index of legit_df is copied to a new column of df, resulting in df looking like this:

enter image description here

The value -1 is used to show when there wasn't a successful match. In the example above, the only observations that were valid are the ones at indexes 2 and 4, which found their matches at indexes 1 and 2 in legit_df.

My current approach to solve this problem uses .apply(). Yes, its slow, but I couldn't find a way to vectorize the function below or use Cython to speed it up:

def getLegitLocationIndex(lat, long):
    idx = legit_df.index[(legit_df['pickup_latitude'] == lat) & (legit_df['pickup_longitude'] == long)].tolist()
    if (not idx):
        return -1
    return idx[0]

df['legit']  = df.apply(lambda row: getLegitLocationIndex(row['pickup_latitude'], row['pickup_longitude']), axis=1)

Since this code is remarkably slow on a DataFrame with 55M observations, my question is: is there faster way to solve this problem?

I'm sharing a Short, Self Contained, Correct (Compilable), Example to help-you-help-me come up with a faster alternative:

import pandas as pd
import numpy as np

data1 = { 'pickup_latitude'  : [41.366138,   40.190564,  40.769413],
          'pickup_longitude' : [-73.137393, -74.689831, -73.863300]
        }

legit_df = pd.DataFrame(data1)
display(legit_df)

####################################################################################

observations = 10000
lat_numbers = [41.366138,   40.190564,  40.769413, 10, 20, 30, 50, 60, 80, 90, 100]
lon_numbers = [-73.137393, -74.689831, -73.863300, 11, 21, 31, 51, 61, 81, 91, 101]

# Generate 10000 random integers between 0 and 10
random_idx = np.random.randint(low=0, high=len(lat_numbers)-1, size=observations)
lat_data = []
lon_data = []

# Create a Dataframe to store 10000 pairs of geographical coordinates
for i in range(observations):
    lat_data.append(lat_numbers[random_idx[i]])
    lon_data.append(lon_numbers[random_idx[i]])

df = pd.DataFrame({ 'pickup_latitude' : lat_data, 'pickup_longitude': lon_data })
display(df.head())

####################################################################################

def getLegitLocationIndex(lat, long):
    idx = legit_df.index[(legit_df['pickup_latitude'] == lat) & (legit_df['pickup_longitude'] == long)].tolist()
    if (not idx):
        return -1
    return idx[0]


df['legit']  = df.apply(lambda row: getLegitLocationIndex(row['pickup_latitude'], row['pickup_longitude']), axis=1)
display(df.head())

The example above creates df with only 10k observations, which takes about 7 seconds to run in my machine. With 100k observations, it takes ~67 seconds to run. Now imagine my suffering when I have to process 55M rows...

karlphillip
  • 92,053
  • 36
  • 243
  • 426
  • What geographical regions are you observing in legit_df? If everything is to the left of Greenwich, your longitudes will be negative and you just discard everything that isn't. – cs95 Sep 25 '18 at 06:24
  • @coldspeed Thanks, but the only thing I'm interested in is *speeding up* this code. I only created the fake locations so it would be easier to understand the need of the second DataFrame. – karlphillip Sep 25 '18 at 06:27
  • I understand that, but sometimes "speeding up code" means understanding what the code is trying to do, the data it is working with, and any relevant domain knowledge. So, if you want to up your chances of having your code sped up, I highly recommend answering any questions that should come up. – cs95 Sep 25 '18 at 06:29
  • @coldspeed I get why you made that comment, thank you. This a completely fictional scenario that was only created to help me get some help in finding a faster approach to handle a similar task. – karlphillip Sep 25 '18 at 06:32
  • 1
    `df.merge(legit_df.reset_index(), on=['pickup_latitude', 'pickup_longitude'], how='left').fillna(-1)` ? – Chris Adams Sep 25 '18 at 06:40
  • @karlphillip cant reproduce that with the given example im afraid. this may help..? https://stackoverflow.com/questions/43196907/valueerror-wrong-number-of-items-passed-meaning-and-suggestions – Chris Adams Sep 25 '18 at 06:59
  • 1
    @ChrisA I stored the result of your merge into `df['legit']`, silly me. – karlphillip Sep 25 '18 at 07:02

2 Answers2

1

I think you can speed this up significantly using a merge instead of the current logic:

full_df = df.merge(legit_df.reset_index(), how="left", on=["pickup_longitude", "pickup_latitude"])

this resets the index of the reference table to make it a column and joins on longitude

full_df = full_df.rename(index = str, columns={"index":"legit"})
full_df["legit"] = full_df["legit"].fillna(-1).astype(int)

this renames to the column name you were after and fills any missings in the join column with -1

Benchmarks:

Old approach: 5.18 s ± 171 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

New approach: 23.2 ms ± 1.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Sven Harris
  • 2,884
  • 1
  • 10
  • 20
1

You could DataFrame.merge with how='left' on the common keys. Resetting the index of legit_df first.

Then fillna with -1:

df.merge(legit_df.reset_index(), on=['pickup_latitude', 'pickup_longitude'], how='left').fillna(-1)

Testing performance:

%%timeit
df['legit']  = df.apply(lambda row: getLegitLocationIndex(row['pickup_latitude'], row['pickup_longitude']), axis=1)

5.81 s ± 179 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%%timeit
(df.merge(legit_df.reset_index(),on=['pickup_latitude', 'pickup_longitude'], how='left').fillna(-1))

6.27 ms ± 254 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Chris Adams
  • 18,389
  • 4
  • 22
  • 39