1

I have table that contains widget orders for multiple departments, with each department represented by its buyer. The table structure looks like this:

+--------------------------+------------+------------+------------+
|         order_id         | order_date | dept_buyer | widget_mfg |
+--------------------------+------------+------------+------------+
| 56991ba89468d0fc1d53781d | 2/16/2015  | Gutierrez  | OTHERSIDE  |
| 56991ba8f020fc065e5b7219 | 11/14/2014 | Moreno     | QUALITEX   |
| 56991ba82340ecb7b2e9dda8 | 1/15/2015  | Gutierrez  | PROGENEX   |
| 56991ba87bacb0ee3161fd61 | 2/4/2015   | Glover     | ULTRASURE  |
| 56991ba8ade6acae3307a3e9 | 4/20/2015  | Hancock    | WEBIOTIC   |
| 56991ba80b404bcc73094e66 | 4/3/2014   | Castro     | PROGENEX   |
| 56991ba8cb37eda5e5557a74 | 7/21/2014  | Moreno     | OTHERSIDE  |
+--------------------------+------------+------------+------------+

Each row represents a single widget order, as widgets are generally ordered individually. The actual table has tens of thousands of rows representing ~3 years of orders. There are ~100 department buyers, ~1000 widget manufacturers.

I want to provide department buyers an order form that contains their most commonly ordered widgets for easier purchasing. From prior experience, I know that many department buyers order similar widgets. That is, department buyers can be clustered together by their widget buying behavior. For this reason, as well as for maintenance purposes, I would like to create as few forms as possible while still capturing the most commonly ordered widgets for the department buyers that will use the form.

This seems like a machine learning clustering problem to me, but I am not familiar enough with the subject area to get a foothold on the problem. Is there an established algorithm or library for tackling a problem like this one?

Amir
  • 10,600
  • 9
  • 48
  • 75
Tony
  • 125
  • 1
  • 1
  • 6
  • You could start with [K-Means Clustering](http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html). – Michael Recachinas Jan 15 '16 at 16:58
  • @MichaelRecachinas my understanding with K-means clustering is that I need to tell it the number of clusters I want, which I don't know. Rather, I want to optimize the amount of similarity between stock forms, regardless of the number of clusters that yields. – Tony Jan 15 '16 at 17:24
  • Understood. It sounds like you're still looking for a clustering algorithm though. Even without knowing "K", you can still use K-Means. You'll just want to use cross validation to optimize the number of clusters and such: http://stackoverflow.com/questions/6615665/kmeans-without-knowing-the-number-of-clusters – Michael Recachinas Jan 15 '16 at 17:35
  • Not to mention, from [this post](http://stackoverflow.com/questions/33685321/clustering-for-categorical-and-numerical-data), it appears that K-means only works for numerical (continuous) data. The dimension that I'm trying to cluster on (widget_mfg) is categorical. – Tony Jan 15 '16 at 18:33
  • Good point -- there are two variations of K-means that operate on categorical data: K-modes (https://github.com/nicodv/kmodes) and K-mediods (http://scialert.net/fulltext/?doi=jai.2013.257.265&org=11) that you may be interested in. The same aforementioned consideration applies -- you can use cross validation to determine the most optimal "K". – Michael Recachinas Jan 15 '16 at 19:10
  • @Tony To me your problem sounds more like a recommender task: if customer A bought widget X then you'll recommend him widgets most frequently bought by other people who also bought X, don't you think so? – Sergey Bushmanov Jan 16 '16 at 05:39
  • @MichaelRecachinas There is no way to find distance measures for device names [without additional info] to do K-means on this data set – Sergey Bushmanov Jan 16 '16 at 05:45
  • @SergeyBushmanov -- you're correct that distance measures wouldn't work as with K-Means, which is the point of K-Modes -- i.e., K-Modes replaces the Euclidean distance measure with a user-defined similarity function. OP would have to provide some basis for calculating similarity (which requires looking a bit into the data to determine what could be leveraged to determine similarity). – Michael Recachinas Jan 16 '16 at 21:25
  • @SergeyBushmanov -- if, however, OP wants to build a system that, based on shopping preferences of previous buyers, *recommends* a new widget, then yeah a recommender system would be well-suited. – Michael Recachinas Jan 16 '16 at 21:26
  • @SergeyBushmanov I'm not really looking for a recommendation engine, I want to proactively cluster the buyers by their widget buying behavior. – Tony Jan 19 '16 at 17:56

0 Answers0