1

Good morning. I have a DB with almost 1.3 million rows (Lunar Craters DB) and I want to cluster the craters that are inside bigger craters. To do that I ordered the DB from bigger to smaller and than iterate the bigger over the others to calculate with the distance between the positions are inside the diameter. The problem is that this calculation take about 50 seconds per crater, so will take some months to calculate all the DB. I tried some alternatives techniques like Dask, Multiprocessing, but didn't work. With anyone could help me.

cluster = 1
for i in range(len(craters_diam)):
    start2 = datetime.now()
    if craters_diam.loc[i, 'CLUSTER'] == 0:
        craters_diam.loc[i, 'CLUSTER'] = cluster
        lat1 = craters_diam.loc[i, 'LAT_CIRC_IMG']
        lon1 = craters_diam.loc[i, 'LON_CIRC_IMG']
        diam = craters_diam.loc[i, 'DIAM_CIRC_IMG']
        for j in range(i+1, len(craters_diam)):
            if craters_diam.loc[j, 'CLUSTER'] == 0:
                lat2 = craters_diam.loc[j, 'LAT_CIRC_IMG']
                lon2 = craters_diam.loc[j, 'LON_CIRC_IMG']
                dist = distance(lat1, lat2, lon1, lon2)
                if dist <= diam/2:
                    craters_diam.loc[j, 'CLUSTER'] = cluster
        cluster += 1
    print(datetime.now() - start2)
print(datetime.now() - start)

The distance function calculate in spheric geometry.

If anyone knows a clever (faster) way to that, thank you!!!

Jérôme Richard
  • 41,678
  • 6
  • 29
  • 59

1 Answers1

1

The computation is slow because the complexity of your algorithm is quadratic: O(n * n) where n is the number of item where the CLUSTER column is set to 0.

First of all, there are many faster algorithms to do clustering. Your algorithm looks like a simplified DBSCAN. For example, a famous one is k-Means which assume you approximately know the number of clusters (this is not the case here). An alternative solution when you do not know the number of clusters is to use the Mean-Shift Clustering although I am not sure it will work on your specific dataset.

To efficiently fetch the neighbour points close to a target, you can use a k-d tree structure. In 2D, you can use a quad-tree which is easier to implement and generally significantly faster. This structure can reduce the complexity of your algorithm from O(n * n) to O(n log n). The idea is to add all the points in the tree and then for each point looks for closes ones. I expect this to be 3~4 orders of magnitude faster in your case. A simple way to do that in Python is to use the Scipy implementation of k-d tree. The Scipy implementation is not very fast, but this should be enough your your algorithm to be drastically faster (although it is a bit complex to use). A faster way would be to implement that in native languages and perform the computation in parallel using multiple threads.

Note that iterating over Pandas dataframe is generally known to be very slow and you should use vectorized functions as much as possible instead. When this is not possible, you can use Numpy or write your own function with Numba or Cython.

Jérôme Richard
  • 41,678
  • 6
  • 29
  • 59