0

I am trying to write a function in R that automatically chooses the optimal parameters epsilon and MinPts in a DBSCAN analysis. I found that the k-nearest neighbour plot was very useful in order to select the optimal eps. However, I am trying to make the whole process automatic, and I was wondering if there was any method to locate the exact position of the knee so as to have the most representative eps possible.

I tried listing all the slopes at each point of a kNN dist plot and take the maximum value (since the knee represents the maximum point of curvature), but It wasn't very useful. Is it, indeed, possible to find the knee automatically or should one just look at the plot visually?

Thanks in advance.

camille
  • 16,432
  • 18
  • 38
  • 60
Swann
  • 1
  • 2
    This might be better suited to [stats.se] if your focus is on the *method* behind finding optimal k more than already having an algorithm in mind and needing to write the code. That said, there are a couple packages that bootstrap this; the one I've used most is factoextra. To get any more than suggestions, though, you'll need to post a [reproducible example](https://stackoverflow.com/q/5963269/5325862) – camille Dec 28 '21 at 16:44
  • Agreed, come back here if you're having trouble implementing code once you decide on a particular approach. It's not a straightforward question. However, I'll add the [NbBlust package](https://cran.r-project.org/web/packages/NbClust/index.html) to simultaneously compare many methods of finding the optimal value of `k` – Dan Adams Dec 28 '21 at 17:03
  • Your definition of "knee" would only apply to a sigmoid curve. Applying it in this case always picks 1 or 2 clusters depending on which end you choose. – dcarlson Dec 28 '21 at 17:04
  • I see, I will definitely post on Cross Validated since i am not really sure if my algorithm is going anywhere (i'm a beginner). With factoextra, have you been able to locate precisely the knee? If yes, I would be glad to know this function. I was also trying to find a R function that locates the maximum gradient of a specific line plot (like the kNNdistplot() one) but in vain. – Swann Dec 28 '21 at 17:14
  • I don't know that the process of deciding on the best k is going to be fully automated, nor should it be necessarily—that's how you end up with black box models that you can't explain and that don't necessarily actually represent the real world or the context and nuance you need. Functions in factoextra and NbClust will test values based on different metrics and report back to you, but those metrics might not agree with each other or be relevant to your context. Ultimately the decision on how to proceed is up to you – camille Dec 28 '21 at 17:57
  • An example from my own work was once trying to cluster neighborhoods based on several socioeconomic factors. Some of the metrics I tested said the "best" k was something like either 2 or 11. Two clusters would tell me very little about the types of neighborhoods, and 11 would be unintelligible to the general public audience I have. So I had to pick the best out of a reasonable range, maybe 3–6. That's just general advice; right now this question is too broad for SO – camille Dec 28 '21 at 18:03
  • Right I see what you mean, I'm going to dig further into this subject. Thank your for your explanations. – Swann Dec 28 '21 at 19:05

0 Answers0