0

I am trying to apply a clustering algorithm by using R. I read a basic introduction for applying dbscan in R as well. My data is start/finish locations and times (more than 50k rows).

This is what the sample looks like:

# A tibble: 10 x 6
   start_location_Long start_location_Lat end_location_Long end_location_Lat start_time1_cos end_time1_cos
                 <dbl>              <dbl>             <dbl>            <dbl>           <dbl>         <dbl>
 1                101.               13.9              101.             13.9          -0.978        -0.998
 2                101.               13.9              101.             13.8          -0.465         0.503
 3                101.               13.9              101.             13.9          -0.756        -0.982
 4                101.               13.8              101.             13.8          -0.827        -0.773
 5                101.               13.8              101.             13.8          -0.956        -0.949
 6                101.               13.8              101.             13.8          -0.969        -0.961
 7                101.               13.8              101.             13.8          -0.946        -0.521
 8                101.               13.8              101.             13.7          -0.972        -0.910
 9                101.               13.7              101.             13.7          -0.840        -0.837
10                101.               13.8              101.             13.7          -0.497        -0.313
data <- structure(list(start_location_Long = c(100.60066, 100.60039,100.56864, 100.59018, 100.55926, 100.61014, 100.61504, 100.75646,100.56093, 100.52679), start_location_Lat = c(13.91761, 13.91746,13.88542, 13.7969, 13.83207, 13.82256, 13.80237, 13.82296, 13.73084,13.76592), end_location_Long = c(100.59982, 100.53864, 100.57354,100.59309, 100.56502, 100.56652, 100.65582, 100.73325, 100.56094,100.53465), end_location_Lat = c(13.91616, 13.8288, 13.86449,13.84172, 13.82841, 13.82762, 13.82176, 13.72228, 13.73224, 13.74595), start_time1_cos = c(-0.977783236758606, -0.464584475495966,-0.756281834105734, -0.827489114105152, -0.955963918764982, -0.968565073328525,-0.946485086708269, -0.971772589428584, -0.839856789165117, -0.497478722371776), end_time1_cos = c(-0.998416312411851, 0.502642787734849, -0.98199994355324,-0.772641247513493, -0.949334100771872, -0.960940326679488, -0.521319957219796,-0.910443172287846, -0.837480354951308, -0.313301931309727)), row.names = c(NA,-10L), class = c("tbl_df", "tbl", "data.frame")) 

Based on this posted Choosing eps and minpts for DBSCAN (R)? I scaled my data and tried to use minpts as 4 and find eps from KNN distances.

enter image description here

However, my clustering results always merge together into 1 group even I tried to change minpts and eps many times.

Therefore, anyone who has experienced using dbscan algorithm please help me. How to cluster it? Because my data is very large and the simple data maybe not help so I also provided the raw data here

Thank you in advance.

Yasumin
  • 443
  • 2
  • 8

1 Answers1

1

Density-based clustering can only separate clusters if there are areas of lower density between them. Your data looks like a single cloud of points (just plot a sample), and this is why DBSCAN does not separate the data into more than one cluster.

You could use k-means, but it will just split the data space into preferably close to spherical areas of roughly similar size with roughly the same number of points each.

Michael Hahsler
  • 2,965
  • 1
  • 12
  • 16
  • Thanks for your answer @Michael Hahsler May I ask you a question? I can see only the visualizing, but I can't find out the proportions of the contributions by the variables in DBSCAN output. Does it have some options to output variables or contributions using R? – Yasumin Jun 01 '21 at 03:53
  • I am not quite sure what you mean by proportions. Partitional clustering only returns class labels. Are you referring to the coordinates of centroids? – Michael Hahsler Jun 02 '21 at 14:47
  • Thanks for answering. @Michael Hahsler Based on my understanding, basically, clustering looks at the covariances between among variables. So, proportions may be like for example, it should be start time(variable) is the highest contribution 30% and end location maybe second contribution 20%. Does Partitional clustering have these kinds of results? – Yasumin Jun 05 '21 at 08:33