I have a large dataset of around 2 million rows and 4 columns: observation_id, latitude, longitude and class_id, like that:
Observation_id | Latitude | Longitude | Class_id |
---|---|---|---|
10131188 | 45.146973 | 6.416794 | 101 |
10799362 | 46.783695 | -2.072855 | 700 |
10392536 | 48.604866 | -2.825003 | 1456 |
... | ... | ... | ... |
22068176 | 29.806055 | -98.41853 | 5532 |
There are 18,000 classes, but some of them are over-represented and some are under-represented. Note that each observation is either in France or in the USA.
I need to find, for each observation, the distance to the closest observation of every class.
For example, for the first observation (which belongs to the class 101 if we look at the table above), I will have a vector of size 18,000. The first value of the vector will represent the distance in km to the closest occurrence of class 1, the second value will represent the distance in km to the closest occurrence of class 2, and so on until the last value which will represent, you guessed it, the distance in km to the closest occurrence of class 18,000.
If the distance is too large (let's say more than 50km), I don't need the exact distance but a fixed value (50 km in this case). So if the closest occurrence from one class to my observation is more than 50km (whether it's 51km or 9,000km), I can fill 50 for the corresponding value of the observation's vector.
But I see two problems here:
- My code will take forever to run.
- The created file will be huge.
I started to create a small script that calculates the haversine distance, but for one observations it takes around 8 seconds to run, so it would be impossible for 2 million. Here it is anyway:
lat1 = 45.705116 # lat for observation 10561949
lon1 = 1.424622 # lon for observation 10561949
df = df[df.observation_id != 10561949] # removing observation 10561949 from the DataFrame
list_obs = np.full(18000, 50) # Array of size 18 000 filled with the value 50
for observation_id, lat2, lon2 in zip(df['observation_id'], df['latitude'], df['longitude']):
lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2]) # convert to radians
a = sin((lat2 - lat1)/2)**2 + cos(lat1) * cos(lat2) * sin((lon2 - lon1 )/2)**2 # Haversine distance (1/2)
dist = 2 * asin(sqrt(a)) * 6371 # Haversine distance (2/2)
if list_obs[observation_id] >= dist:
list_obs[observation_id] = dist
Do you have a idea on how to speed-up the algorithm (the distance doesn't have to be perfectly calculated, I just need to have a global idea on the nearest neighbor of each class for each observation) and to store the gigantic file after that (it will be an array-like of 2,000,000 x 18,000).
The idea after this is to try to feed this to a Neural Network (let's say a MLP), to see the difference with a simple K-Nearest Neighbor.