Identifying Data Bands based on Distance between Centroids with Clustering in R

Question

I'm trying to use clustering to identify bands in my data set. I'm working with supply chain data, so my data looks like this:

The relevant column is the price per Each.

The problem is that sometimes we incorrectly have that this product comes in a Case of 100 instead of 10, so the Price per Each would look like (2, 0.25, 3). I want to create a code that only creates clusters if the mean price of an additional cluster is at least 2 times greater or lesser than all existing clusters.

For example, if my prices per each were (4, 5, 6, 13, 14, 15), I want it to return 2 clusters with centroids of 5 and 14. If, on the other hand, my data looked like (3, 4, 5, 6), it should return one cluster.

The goal is to create a code that returns the product codes for items in which multiple clusters have been generated so that I can audit those product codes for bad units of measure (case 100 vs case 10).

I'm thinking about using divisive hierarchical clustering, but I don't know how to introduce the centroid distance rule for creating new clusters.

I'm fairly new to R, but I have SQL and Stata experience, so I'm looking for a package that would do this or help with the syntax I need to accomplish this.

Please pop over to http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example to learn how to make a good reproducible post. 1) please add data as text and be sure to show your desired output. — emilliman5, May 25 '17 at 21:45

score 0 · Answer 1 · answered May 27 '17 at 15:28

Don't use clustering here.

While you can probably use HAC with a ratio-like distance function and a threshold of 8x, this will be rather unreliable and expensive: clustering will take O(n²) or O(n³) usually.

If you know that these error happen, but not frequently, then I'd rather use a classic statistical approach. For example, compute the median and then report values that are 9x times larger/smaller than the median as errors. If errors are infrequent enough, you could even use the mean, but the median is more robust.

Identifying Data Bands based on Distance between Centroids with Clustering in R

1 Answers1