1

I have a large dataset of around 2 million rows and 4 columns: observation_id, latitude, longitude and class_id, like that:

Observation_id Latitude Longitude Class_id
10131188 45.146973 6.416794 101
10799362 46.783695 -2.072855 700
10392536 48.604866 -2.825003 1456
... ... ... ...
22068176 29.806055 -98.41853 5532

There are 18,000 classes, but some of them are over-represented and some are under-represented. Note that each observation is either in France or in the USA.

I need to find, for each observation, the distance to the closest observation of every class.

For example, for the first observation (which belongs to the class 101 if we look at the table above), I will have a vector of size 18,000. The first value of the vector will represent the distance in km to the closest occurrence of class 1, the second value will represent the distance in km to the closest occurrence of class 2, and so on until the last value which will represent, you guessed it, the distance in km to the closest occurrence of class 18,000.

If the distance is too large (let's say more than 50km), I don't need the exact distance but a fixed value (50 km in this case). So if the closest occurrence from one class to my observation is more than 50km (whether it's 51km or 9,000km), I can fill 50 for the corresponding value of the observation's vector.

But I see two problems here:

  • My code will take forever to run.
  • The created file will be huge.

I started to create a small script that calculates the haversine distance, but for one observations it takes around 8 seconds to run, so it would be impossible for 2 million. Here it is anyway:

lat1 = 45.705116 # lat for observation 10561949
lon1 = 1.424622 # lon for observation 10561949
df = df[df.observation_id != 10561949] # removing observation 10561949 from the DataFrame

list_obs = np.full(18000, 50) # Array of size 18 000 filled with the value 50

for observation_id, lat2, lon2 in zip(df['observation_id'], df['latitude'], df['longitude']):
  lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2]) # convert to radians
  a = sin((lat2 - lat1)/2)**2 + cos(lat1) * cos(lat2) * sin((lon2 - lon1 )/2)**2 # Haversine distance (1/2)
  dist = 2 * asin(sqrt(a)) * 6371 # Haversine distance (2/2)
  if list_obs[observation_id] >= dist:
    list_obs[observation_id] = dist

Do you have a idea on how to speed-up the algorithm (the distance doesn't have to be perfectly calculated, I just need to have a global idea on the nearest neighbor of each class for each observation) and to store the gigantic file after that (it will be an array-like of 2,000,000 x 18,000).

The idea after this is to try to feed this to a Neural Network (let's say a MLP), to see the difference with a simple K-Nearest Neighbor.

  • Though I don't understand a few details of what you are doing, it seems to me like you would benefit from binning your dataset first into discreet units of 50km, then when you are checking distance only check your current bin and the surrounding 8 bins, it avoids you computing the distance to object too far away and dramatically reduces distance computation (commonly used in simulations of say fluids) – Richard Boyne Jun 07 '22 at 09:28
  • Hello @RichardBoyne and thanks for your answer. Don't hesitate to tell me what wasn't clear so I can edit my post. As you can see on the table, my dataset doesn't contain distances but latitudes and longitudes, so I can't bin it into discreet units of 50km (since the distance will change according to which observations we are working with). Or maybe I didn't understand something. As a reminder, the goal is, for **each** row of the DataFrame, to find the distance of the nearest neighbor of **each** of the 18 000 classes (or simply put 50 if the distance is larger than 50km). – César Leblanc Jun 07 '22 at 09:38
  • Don't know how evenly your data is distributed along latitude and longitude. But offset of 1 degree latitude is approx. 111 kilometers. So based only comparing latitudes you can probably filter out a lot of points. – Fedor Petrov Jun 08 '22 at 08:07
  • Hello @FedorPetrov and thanks for your answer. The data is distributed in two countries (France and the USA). So the latitude goes between 24 and 51 (between 41 and 51 for french observations and between 24 and 49 for american observations). But it's a good idea, I'll try to remove points that have a difference in latitude of 1 or more, and points that have a difference in longitude of 1/cos(latitude) or more (according to some basic calculations, I can simply ignore points that have a difference in longitude of 1 or more). Thanks again, I'll check if it can reduce the time of calculations. – César Leblanc Jun 08 '22 at 15:00
  • Just two countries, and they're far away from the poles? That really should be in the question. – Kelly Bundy Jun 08 '22 at 15:20
  • Thanks for the remark @KellyBundy, you're right I added it to the original question. – César Leblanc Jun 09 '22 at 07:38

1 Answers1

0

Since you only care about distances <50km the best saving that I can think of is to pre-calculate (approximately) absolute distances on a grid to exclude the need to compute distances for large values.

Below is my best attempt to solve this, it has a setup complexity of O(len(df)) but a search complexity of just O(9 * avg. bin size) which is significantly less than O(len(df)) from your example.

Note 1) there are large parts of this that can be vectorized to improve performance.

Note 2) there are most certainly better ways to bin distances on a sphere, I am just not that familiar with them, but the idea to first index values such that you can quickly find all data points within distance x is the key.

Note 3) I would be surprised if this code is bug free.


# generate dummy data  --------------------------
import pandas as pd
import random

random.seed(10)
rand_float = lambda :(random.random()-.5)*90*2
rand_int = lambda :int(random.random()*18000)
dummy_data = [(rand_float(), rand_float(), rand_int()) for i in range(100_000)]
df = pd.DataFrame(data=dummy_data, columns=('lat', 'lon', 'class'))


# bin data points -------------------------------
from collections import defaultdict

def find_bin(lat, lon, bin_size=60):
    """Approximately bin data points
    https://stackoverflow.com/questions/1253499/simple-calculations-for-working-with-lat-lon-and-km-distance
    approximate distance conversion to exclude "far away" distances - only needs to be approximate since we add buffer,
    however I am sure there are much better methods one could use to do this approximation, I spent 10 mins googling
    """
    return int(110*lat//bin_size), int(110*cos(lon)//60)

bins = defaultdict(list)
for i, row in df.iterrows():  # O(len(df))
    bins[find_bin(row['lat'], row['lon'])].append(int(i))  # this is slow, it can be vectorized less elegantly but only needs run once
    
print(f'average bin size {sum(map(len, bins.values()))/len(bins)}')


# find distances to neighbours ------------------
from math import radians, sin, cos, asin, sqrt

def compute_distance(lon1, lat1, lon2, lat2):
    lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2]) # convert to radians
    a = sin((lat2 - lat1)/2)**2 + cos(lat1) * cos(lat2) * sin((lon2 - lon1 )/2)**2 # Haversine distance (1/2)
    return 2 * asin(sqrt(a)) * 6371 # Haversine distance (2/2)

def neighbours(x, y):
    yield x-1, y-1
    yield x-1, y
    yield x-1, y+1
    yield x  , y-1
    yield x  , y
    yield x  , y+1
    yield x+1, y-1
    yield x+1, y
    yield x+1, y+1
    
obs = 1000  #  10561949
lat1, lon1, _ = df.iloc[obs].values
print(f'finding {lat1}, {lon1}')
b = find_bin(lat1, lon1)

list_obs = np.full(18000, 50) # Array of size 18 000 filled with the value 50

for adj_bin in neighbours(*b):  # O(9)
    for i in bins[adj_bin]:  # O(avg. bin size)
        lat2, lon2, class_ = df.loc[i].values
        dist = compute_distance(lat1, lon1, lat2, lon2)
        if dist < list_obs[int(class_)]:
            print(f'dist {dist} to {i} ({class_})')
            list_obs[int(class_)] = dist
Richard Boyne
  • 327
  • 1
  • 7
  • Hello @RichardBoyne and thank you for your answer. Indeed, it seems to be a good idea to separate the world into different bins, and only calculate the distance between observations that are in bins next to our current observation. However, this takes **way more time** (my "solution" was running in 6 seconds, yours is running in 6 minutes). There are far less "distance calculations" (around 150 000 instead of 2 million), but I think the double loop is making us lose a lot of time... Since I have to calculate this for each one of the observations, it's just not feasible this way... But thanks! – César Leblanc Jun 08 '22 at 14:35
  • hey @CésarLeblanc, yes I am sure this code is not performant due to the way the for loops are written and how that will be slicing memory in different parts of the array. It can likely be vectorized better, perhaps aggregating all the index's and then slicing the array once would be a good first start. But these are more common cases that can be looked up in stack overflow more easily. Good luck! – Richard Boyne Jun 09 '22 at 07:37