1

I'm trying to use clustering to identify bands in my data set. I'm working with supply chain data, so my data looks like this:

Img

The relevant column is the price per Each.

The problem is that sometimes we incorrectly have that this product comes in a Case of 100 instead of 10, so the Price per Each would look like (2, 0.25, 3). I want to create a code that only creates clusters if the mean price of an additional cluster is at least 2 times greater or lesser than all existing clusters.

For example, if my prices per each were (4, 5, 6, 13, 14, 15), I want it to return 2 clusters with centroids of 5 and 14. If, on the other hand, my data looked like (3, 4, 5, 6), it should return one cluster.

The goal is to create a code that returns the product codes for items in which multiple clusters have been generated so that I can audit those product codes for bad units of measure (case 100 vs case 10).

I'm thinking about using divisive hierarchical clustering, but I don't know how to introduce the centroid distance rule for creating new clusters.

I'm fairly new to R, but I have SQL and Stata experience, so I'm looking for a package that would do this or help with the syntax I need to accomplish this.

C. Phil
  • 11
  • 2
  • Please pop over to http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example to learn how to make a good reproducible post. 1) please add data as text and be sure to show your desired output. – emilliman5 May 25 '17 at 21:45

1 Answers1

0

Don't use clustering here.

While you can probably use HAC with a ratio-like distance function and a threshold of 8x, this will be rather unreliable and expensive: clustering will take O(n²) or O(n³) usually.

If you know that these error happen, but not frequently, then I'd rather use a classic statistical approach. For example, compute the median and then report values that are 9x times larger/smaller than the median as errors. If errors are infrequent enough, you could even use the mean, but the median is more robust.

Has QUIT--Anony-Mousse
  • 76,138
  • 12
  • 138
  • 194