7

I have series of line data (2-3 connected points). What is the best machine learning algorithm that I can use to be able to classify lines to their location similarities? (image below)

Preferably python libraries such as SciKit-Learn.

CLICK HERE TO SEE THE IMAGE

Edit: I have tried DBSCAN, but the problem I faced was if there are two lines intersect each other, sometimes DBSCAN consider them to one group even though they are completely in different direction.

Here is a solution I found so far:

GeoPath Clustering Algorithm

The idea here is to cluster geo paths that travel very similar to each other into groups.

Steps:

1- Cluster lines based on slope

2- Within each cluster from step 1, find centriod of lines and by using k-mean algorithm cluster them into smaller groups

3- Within each geoup from step 2, calculate lenght of each line and group lines within defined length threshold

Result will be small groups of lines that have similar slope, close to each other and with similar travel distance.

Here are screen shots of visualization: Yellow lines are all lines and red are cluster of paths travel together.enter image description here

enter image description here

enter image description here

user2146024
  • 109
  • 1
  • 3
  • this is actually called clustering. But for such simple data you should be able to write down a simple heuristic to do it. – lejlot Jul 29 '16 at 21:16

2 Answers2

3

I'll throw an answer since I think the current one is incomplete...and I also think the comment of "simple heuristic" is premature. I think that if you cluster on points, you'll get a different result than what your diagram depicts. As the clusters will be near the end-points and you wouldn't get your nice ellipses.

So, if your data really does behave similarly to how you display it. I would take a stab at turning each set of 2/3 points into a longer list of points that basically trace out the lines. (you will need to experiment on how dense)

Then run HDBSCAN on the result see video ( https://www.youtube.com/watch?v=AgPQ76RIi6A ) to get your clusters. I believe "pip install hdbscan" installs it.

Now, when testing a new sample, first decompose it into many(N) points and fit them with your hdbscan model. I reckon that if you take a majority voting approach with your N points, you'll get the best overall cluster to which the "line" belongs.

So, while I sort of agree with the "simple heuristic" comment, it's not so simple if you want the whole thing automated. And once you watch the video you may be convinced that HDBSCAN, because of its density-based algorithm, will suit this problem(if you decide to create many points from each sample).

I'll wrap up by saying that I'm sure there are line-intersection models that have done this before...and that there does exist heuristics and rules that can do the job. Likely, they're computationally more economical too. My answer is just something organic using sklearn as you requested...and I haven't even tested it! It's just how I would proceed if I were in your shoes.

edit

I poked around and there a couple of line similarity measures you can possibly try. Frechet and Hausdorff distance measures.

Frechet: http://arxiv.org/pdf/1307.6628.pdf Hausdorff: distance matrix of curves in python for a python example.

If you generate all pair-wise similarities and then group them according to similarity and/or into N bins, you can then call those bins your "clusters" (not kmeans clusters though!). For each new line, generate all similarities and see which bin it belongs to. I revise my original comment of possibly being computationally less intensive...you're lucky your lines only have 2 or 3 points!

Community
  • 1
  • 1
user1269942
  • 3,772
  • 23
  • 33
0

The problem you're trying to solve is called clustering. For an overview of clustering algorithms in sklearn, see http://scikit-learn.org/stable/modules/clustering.html#clustering.

Edit 2: KMeans was what sprung to mind when I first saw your post, but based on feedback from the comments it looks like it's not a good fit. You may want to try sklearn's DBSCAN instead.

A potential transformation or extra feature you could add would be to fit a straight line to each set of points, and then use the (slope, intercept) pair. You may also want to use the centroid of each line.

bpachev
  • 2,162
  • 15
  • 17
  • kmeans is not really designed to handle such non-euclidean spaces, where your metric would probably be something like a volume of a convex hull of two lines. However, given such metric even simple DBScan should work just fine (or other clustering which accepts arbitrary metric) – lejlot Jul 29 '16 at 21:40
  • Thanks for your feedback. I have updated my answer to reflect it. KMeans might work on the OP's posted dataset, though. – bpachev Jul 29 '16 at 21:57
  • Kmeans is not well defined withotu euclidean metric - it simply will not work (it's behaviour is ill defined when you do use other metric, as taking an average for centroids does not have to reduce the cost), consequently you might even have not converging method. His data is not even R^n space (as each line is a varied-length points sequence). He would have to fix the amount of points, and then use euclidean distance and k-means will converge, but such clustering is meaningless (euclidean distance between end points does not capture the "lineness distance") – lejlot Jul 29 '16 at 22:13
  • His lines are defined as a sequence of two or three points. He could input those individual points into KMeans, and it would work for his posted dataset. The potential problem would be if points belonging to the same line are put in different clusters. – bpachev Jul 29 '16 at 22:17
  • clustering end-points does not make any sense (even in the example provided, you would end up with parts of lines in different clusters; and this would be an often situation as any long line will get split) – lejlot Jul 29 '16 at 22:28
  • Thanks again for your feedback. I've updated my post accordingly. – bpachev Jul 29 '16 at 22:43