I have a collection of alphanumeric product codes of various products. Similar products have no intrinsic similarity in their codes, ie product code "A123" might mean "Harry Potter Volume 1 DVD" and "B123" might mean "Kellogs Corn Flakes". I also do not actually have the description or identify of the product. All I have is an "owner" of this code. My data, therefore, looks (in a non-normal way) something like this:
Owner1: ProductCodes A123,B124,W555,M221,M556,127,102
Owner2: ProductCode D103,Z552,K112,L3254,223,112
Owner3: ProductCode G123
....
I have huge (ie Terabytes) sets of this data.
I assume that an owner would - for the majority - have an undetermined number of groups of similar products - ie an owner might have just 2 groups - all the DVDs and books of Harry Potter, but also a collection of "Iron Maiden" cds. I would like to analyse this data and determine distance functions between product codes so I can start making assumptions about "how close" product codes are to each other and also cluster product codes (so I can also identify how many groups an owner has). I have started doing some research on textual clustering algorithms but there are numerous ones to choose from and I'm not sure on which one(s) work best with this scenario.
Can someone point me towards the most appropriate python based clustering functions / libraries to use please ?!