Simply put, I'm trying to compare values from 2 columns of the first DataFrame
with the same columns in another DataFrame
. The indexes of the matched rows are stored as a new column in first DataFrame
.
Let me explain: I'm working with geographical features (latitude/longitude) and the main DataFrame
-- called df
-- has something like 55M observations, which look a little bit like this:
As you can see, there are only two rows with data that looks legit (indexes 2 and 4).
The second DataFrame
-- called legit_df
--, is much much smaller and has all the geographical data I consider legit:
Without getting into WHY, the main task involves comparing each latitude/longitude observation from df
to the data of legit_df
. When there's a successful match, the index of legit_df
is copied to a new column of df
, resulting in df
looking like this:
The value -1
is used to show when there wasn't a successful match. In the example above, the only observations that were valid are the ones at indexes 2 and 4, which found their matches at indexes 1 and 2 in legit_df
.
My current approach to solve this problem uses .apply()
. Yes, its slow, but I couldn't find a way to vectorize the function below or use Cython to speed it up:
def getLegitLocationIndex(lat, long):
idx = legit_df.index[(legit_df['pickup_latitude'] == lat) & (legit_df['pickup_longitude'] == long)].tolist()
if (not idx):
return -1
return idx[0]
df['legit'] = df.apply(lambda row: getLegitLocationIndex(row['pickup_latitude'], row['pickup_longitude']), axis=1)
Since this code is remarkably slow on a DataFrame
with 55M observations, my question is: is there faster way to solve this problem?
I'm sharing a Short, Self Contained, Correct (Compilable), Example to help-you-help-me come up with a faster alternative:
import pandas as pd
import numpy as np
data1 = { 'pickup_latitude' : [41.366138, 40.190564, 40.769413],
'pickup_longitude' : [-73.137393, -74.689831, -73.863300]
}
legit_df = pd.DataFrame(data1)
display(legit_df)
####################################################################################
observations = 10000
lat_numbers = [41.366138, 40.190564, 40.769413, 10, 20, 30, 50, 60, 80, 90, 100]
lon_numbers = [-73.137393, -74.689831, -73.863300, 11, 21, 31, 51, 61, 81, 91, 101]
# Generate 10000 random integers between 0 and 10
random_idx = np.random.randint(low=0, high=len(lat_numbers)-1, size=observations)
lat_data = []
lon_data = []
# Create a Dataframe to store 10000 pairs of geographical coordinates
for i in range(observations):
lat_data.append(lat_numbers[random_idx[i]])
lon_data.append(lon_numbers[random_idx[i]])
df = pd.DataFrame({ 'pickup_latitude' : lat_data, 'pickup_longitude': lon_data })
display(df.head())
####################################################################################
def getLegitLocationIndex(lat, long):
idx = legit_df.index[(legit_df['pickup_latitude'] == lat) & (legit_df['pickup_longitude'] == long)].tolist()
if (not idx):
return -1
return idx[0]
df['legit'] = df.apply(lambda row: getLegitLocationIndex(row['pickup_latitude'], row['pickup_longitude']), axis=1)
display(df.head())
The example above creates df
with only 10k observations
, which takes about 7 seconds to run in my machine. With 100k observations
, it takes ~67 seconds to run. Now imagine my suffering when I have to process 55M rows...