How do I determine the distance / eps for DBSCAN in R?

Question

I have a dataset of points;

 lat   |long    | time
 34.53  -126.34  1
 34.52  -126.32  2
 34.51  -126.31  3
 34.54  -126.36  4
 34.59  -126.28  5
 34.63  -126.14  6
 34.70  -126.05  7
 ...

(Much larger dataset, but this is the general structure.)

I want to cluster points based on distance and time. DBSCAN seems like a good choice, since I don't know how many clusters there are.

I am using, currently, minute/5500 (which is approx 20 meters, scaled, I believe.)

library(fpc)
 results<-dbscan(data,MinPts=3,eps=0.00045,method="raw",scale=FALSE,showplot=1)

I am having a problem understanding how the scaling / distance is determined, since I have raw data. I can guess at values for eps when scaled or unscaled, but I am unclear what the scaling does, or what distance metric is being used (Euclidean distance, perhaps?) Is there documentation on this somewhere?

(This is not about finding an automated way to choose, (like Choosing eps and minpts for DBSCAN (R)? ) but about what the different values mean. Saying "You need a distance function first" doesn't explain what the distance function being used is, or how to create one...)

From which package is `dbscan`? Is it `fpc` or `RWeka` or something else? — mnel, Feb 21 '13 at 01:30
I see this as being somewhat different to the question marked as a duplicate. I'm not sure it is a programming question or a stats question but it is different to the duplicate. — Gavin Simpson, Feb 26 '13 at 19:08

score 1 · Answer 1 · answered Feb 21 '13 at 06:39

I don't use R/fpc but ELKI, so I can't really answer your question. The reason is that I have found it to be substantially faster than fpc, in particular when you can use indexes. When you work with data sets in the million points, the difference is huge.

Furthermore, it's very flexible, and that seems to be what you need:

ELKI does have a LatLng distance function that uses the great circle distance. Then I can set epsilon easily in kilometres.

However, you also have a time attribute. Do you have any plans on including this in your analysis yet? ELKI has a tutorial on writing custom distance functions, which is probably what you need then. You should be able to reuse the great circle distance, and here is a neat trick with DBSCAN for you:

DBSCAN doesn't really need the distances. It needs to know the neighbors, but the distances are only used for comparison to epsilon. So by defining a distance function that is 0 when two objects should be similar, and 1 if the should be different, along with an epsilon of 0.5, you can do much more complex clusterings. In your context, you could define your distance function as:

0 if the distance is less than 0.1 km and the time difference is also less than t
1 otherwise

Thanks for the help, but is there anything on how to do this in R? Also, as noted in the original post, I am using time. — David Manheim, Feb 26 '13 at 17:52
No, I don't use R. Most likely it defaults to Euclidean distance, I don't know if it also allows you to use other distances. Oh, and minpts=3 is likely too small. Use larger values. — Has QUIT--Anony-Mousse, Mar 01 '13 at 23:46

score 1 · Accepted Answer · answered Sep 04 '13 at 14:34

1

First calculate the distance matrix of your data. Then, instead of using method='row' you could use method='dist'. In this way, dbscan will treat your data as distance matrix and so no need to worry about how distance function is implemented. Note that this might require more memory since you're pre-calculating distance matrix and store it in memory.

answered Sep 04 '13 at 14:34

fatih

1,171
2
14
26

This does end up being a memory problem, but is helpful to understand. – David Manheim Oct 16 '13 at 20:37

How do I determine the distance / eps for DBSCAN in R?

2 Answers2