-1

I have a csv file having 140K rows. Working with pandas library. Now the problem is I have to compare each rows with every other rows. Now the problem is it's taking too much time. At the same time, I am creating another column where I am appending many data for each row based on the comparison. Here I am getting memory error.

What is the optimal solution for atleast Memory error? I am working on 12GB RAM, Google Colaboratory. Dataframe sample:

ID    x_coordinate   y_coordinate
1     2              3
2     3              4
............
X     1              5

Now, I need to find distance each row with other rows and if the distance in certain threshold, I am assigning a new id for that two row which are in certain distance. So, if in my case ID 1 and ID 2 is in a certain distance I assigned a for both. And ID 2 and ID X is in certain distance I am assigning b as new matched id like below

ID    x_coordinate   y_coordinate   Matched ID
1     2              3              [a]
2     3              4              [a, b]
............
X     1              5              [b]

For distance I am using √{(xi − xj)2 + (yi − yj)2} Threshold can be anything. Say m unit.

  • Can you show us a sample dataframe and expected results – XXavier Sep 29 '21 at 13:19
  • @XXavier sorry. It's not possible to share the dataframe. Say I am X number of rows. Now I have to compare each row with other X-1 rows. And that's for all X rows. In my case X is around 140K. I know it's getting O(n^2). Is there any way to handle the memory atleast? – Foysal Khandakar Joy Sep 29 '21 at 13:22
  • You don't have to show us the actual dataframe. A sample dataframe for us to understand your problem is good enough. In the meantime can you try this `df.columnName.diff()`. columnName is the name of your column that you want to find the difference – XXavier Sep 29 '21 at 13:24
  • How do you define the thresholds on distance? – Cimbali Sep 29 '21 at 13:35
  • Hi thanks both. Please check the updated question. – Foysal Khandakar Joy Sep 29 '21 at 13:39
  • Foysal, if i get you right, you are trying to find the difference in values of 2 columns and then assigning a matched ID based on the difference value, correct ? And the problem you are facing is that your loop while executing takes time and you get a memory error while storing data ? If so, I'd suggest you break your data frame into chunks and work on the chunks instead. Here's a link you can check out - https://stackoverflow.com/questions/33642951/python-using-pandas-structures-with-large-csviterate-and-chunksize – Gary Sep 29 '21 at 13:42
  • Please also add the portion of the code that gives you the memory error (the computing portion, no need to add the whole code) so we can take a look – Gary Sep 29 '21 at 13:46

1 Answers1

0

This reads like you attempt to hold the complete square distance matrix in memory, which obviously doesn't scale very well, as you have noticed.

I'd suggest you to read up on how DBSCAN clustering approaches the problem, compared to e.g., hierarchical clustering:

Instead of computing all the pairwise distances at once, they seem to

  • put the data into a spatial database (for efficient neighborhood queries with a threshold) and then
  • iterate the points to identify the neighbors and the relevant distances on the fly.

Unfortunately I can't point you to readily available code or pandas functionality to support this though.

moooeeeep
  • 31,622
  • 22
  • 98
  • 187