7

I have a collection of alphanumeric product codes of various products. Similar products have no intrinsic similarity in their codes, ie product code "A123" might mean "Harry Potter Volume 1 DVD" and "B123" might mean "Kellogs Corn Flakes". I also do not actually have the description or identify of the product. All I have is an "owner" of this code. My data, therefore, looks (in a non-normal way) something like this:

Owner1: ProductCodes A123,B124,W555,M221,M556,127,102

Owner2: ProductCode D103,Z552,K112,L3254,223,112

Owner3: ProductCode G123

....

I have huge (ie Terabytes) sets of this data.

I assume that an owner would - for the majority - have an undetermined number of groups of similar products - ie an owner might have just 2 groups - all the DVDs and books of Harry Potter, but also a collection of "Iron Maiden" cds. I would like to analyse this data and determine distance functions between product codes so I can start making assumptions about "how close" product codes are to each other and also cluster product codes (so I can also identify how many groups an owner has). I have started doing some research on textual clustering algorithms but there are numerous ones to choose from and I'm not sure on which one(s) work best with this scenario.

Can someone point me towards the most appropriate python based clustering functions / libraries to use please ?!

denis
  • 21,378
  • 10
  • 65
  • 88
Richard Green
  • 2,037
  • 2
  • 20
  • 38
  • I will dig through the data later tonight and update this with some stats. Any more questions ?! – Richard Green Mar 07 '11 at 15:17
  • Roughly how many owners, how many different ProductCodes are there, also average number of codes per owner ? If one could generate random data of about the right size and correlation, various people might try running ... on it. – denis Mar 08 '11 at 09:56

6 Answers6

8

What you have is a bipartite graph. As an initial stab, it sounds like you are going to treat neighbour lists as zero-one vectors between which you define some kind of similarity/correlation. This could be a normalised Hamming distance for example. Depending on which way you do that you will obtain a graph on a single domain -- either product codes or owners. It will shortly become clear why I've cast everything in the language of graphs, bear with me. Now why do you insist on a Python implementation? Clustering large scale data is time and memory consuming. To pull the cat out of the bag, I have written and still maintain a graph clustering algorithm, used quite widely in bioinformatics. Is is threaded, accepts weighted graphs, and has been used for graphs with millions of nodes and towards a billion of edges. Refer to http://micans.org/mcl/ for more information. Of course, if you trawl stackoverflow and stackexchange there is quite a few threads that may be of interest to you. I would recommend the Louvain method as well, except that I am not sure whether it accepts weighted networks, which you will probably produce.

micans
  • 1,106
  • 7
  • 16
  • The more I look at the answers I get regarding this, the more I think that Python is just a preferred option and not necessarily an essential. You are correct in the fact that it zero-one vectors. I will take a look at your references to see if I can use them. Thanks for the response. – Richard Green Mar 08 '11 at 09:11
  • I forgot to say, some of those other threads/questions/topics refer to 'community detection' in social networks. This is exactly the same as cluster analysis in graphs. In contrast to the deluge of published clustering algorithms, not many are available in software, especially if scalability and also (reasonable) licensing are important. This should make it easier to focus your search. – micans Mar 08 '11 at 11:30
  • Your question has received quite some attention in the scientific literature as well. Searching for 'clustering of product baskets' led me for example to 'A scalable approach to balanced, high dimensional clustering of market-baskets' by Strehl and Ghosh. This article, and the ones that cite it (http://scholar.google.co.uk/scholar?cites=13994245549005683952&as_sdt=2005&sciodt=0,5&hl=en) might be also be useful in your quest. – micans Mar 08 '11 at 11:52
  • Great comments - thanks. At least now I can start using the right terminology - ie "market-baskets" and also "affinity correlation" - which is basically the thing I am looking to measure. I have just downloaded "data miner" so I am hoping I might just use that for the meanwhile (if it copes). Keep coming with the pointers though please ! – Richard Green Mar 09 '11 at 12:50
  • As I think about it more, other interesting aspects come to the fore. This paper and related papers could be pertinent: "Toward the Next Generation of Recommender Systems: A Survey of the State-of-the-Art and Possible Extensions" by Adomavicius and Tuzhilin. Then, how about tranforming your data to a graph using a k-nearest neighbour search? With your scale of data you want to avoid doing all pairwise comparisons - this is a whole topic of research in itself, see e.g. efficient 'k' or 'generalized' nearest neighbour search in high dimensional spaces, k-d trees. – micans Mar 10 '11 at 11:19
2

R language has many packages for finding groups in data, and there are python bindings to R, called RPy. R provides several algorithms already mentioned here and also known for good performance on large datasets.

eGlyph
  • 1,117
  • 6
  • 11
2

I think you can use pycluster also change algorithm for your problem

also i think you better see this http://www.dennogumi.org/2007/11/data-clustering-with-python

Mohammad Efazati
  • 4,812
  • 2
  • 35
  • 50
1

I don't know much about your problem domain. But PyCluster is pretty decent clustering package which works good on large datasets: http://bonsai.hgc.jp/~mdehoon/software/cluster/software.htm

Hope it helps.

memimo
  • 501
  • 1
  • 4
  • 8
  • I have started looking at a number of the "easily google-able" python libs out there but just aren't sure which one fits this domain the best. The biggest issue I have is that an owner may have multiple groups of products and those groups are not necessarily correlated, so not only do I need to cluster over the whole dataset, but I need to cluster (potentially) within each owner as well :-/ – Richard Green Feb 15 '11 at 14:13
0

I don't know of an off-the-shelf lib, sorry. There are big libs for full-text search and similarity, but for bit sets you'll have to roll your own (as far as i know). A couple of suggestions anyway:

  • bitset approach: first get say 10k owners x 100k products, or 100k x 10k, in memory, to play with. You could use bitarray to make a big array of 10k x 100k bits. But then, what do you want to do with it ?
    To find similar pairs among N objects (either owners or products), you have to look at all N*(N-1)/2 pairs, which is a lot;
    or, there must be some structure in tha data that allows early pruning / hierarchical similarity;
    or, google "greedy clustering" Python — don't see an off-the-shelf lib.

  • how do you define "similarity" of owners / of products ? There are lots of possibilities — number in common, ratio in common, tf-idf ...

(Added): have you looked at Mahout's recommendation system API, is that about what you're looking for ?
This SO question says there's no Python equivalent, which leaves two choices:
a) ask if anyone has used Mahout from Jython, or b) if you can't lick 'em, join 'em.

Community
  • 1
  • 1
denis
  • 21,378
  • 10
  • 65
  • 88
  • Have just read the "added" section - Mahout could fit the bill - I would have really preferred not to include yet another group of technologies into the mix, but if it does what I need it to do, then I'll bite the preverbial bullet. – Richard Green Mar 07 '11 at 15:32
0

You can try to do clustering using the k-means clustering algorithm and its scipy implementation available in scikits.learn.cluster.KMeans.

Andrea Spadaccini
  • 12,378
  • 5
  • 40
  • 54
  • Andrea, have you used this on large sparse bit sets with a custom metric ? It's easy to modify any 1-page kmeans.py, but how big, what metric, would be good to know. – denis Feb 18 '11 at 17:13
  • @Denis, no I did not use it, I used other parts of scikits.learn, so I cannot give direct feedback on it. – Andrea Spadaccini Feb 18 '11 at 17:16