I'm trying to use clustering to identify bands in my data set. I'm working with supply chain data, so my data looks like this:
The relevant column is the price per Each.
The problem is that sometimes we incorrectly have that this product comes in a Case of 100 instead of 10, so the Price per Each would look like (2, 0.25, 3). I want to create a code that only creates clusters if the mean price of an additional cluster is at least 2 times greater or lesser than all existing clusters.
For example, if my prices per each were (4, 5, 6, 13, 14, 15), I want it to return 2 clusters with centroids of 5 and 14. If, on the other hand, my data looked like (3, 4, 5, 6), it should return one cluster.
The goal is to create a code that returns the product codes for items in which multiple clusters have been generated so that I can audit those product codes for bad units of measure (case 100 vs case 10).
I'm thinking about using divisive hierarchical clustering, but I don't know how to introduce the centroid distance rule for creating new clusters.
I'm fairly new to R, but I have SQL and Stata experience, so I'm looking for a package that would do this or help with the syntax I need to accomplish this.