How do you cluster items based on observations of which items occur together?
I have a problem like the following. Say I'm studying children's games and want to know which games tend to be played together. I've identified 12 games, and gone to the playground 100 breaktimes in a row and observed which games are being played by the children each time. My observations are effectively boolean: 1 if the game is played that time, 0 otherwise, as in the following dataframe (actually random numbers don't work very well here, but they show the type of the data I'm working with).
import random
import numpy as np
import pandas as pd
random.seed(0)
games=['Game %d'%i for i in range(0,12)]
observations=pd.DataFrame((np.random.rand(100, 12)*0.55).round(), columns=games)
I want to know which games tend to be played together.
The approach I've used so far is to create a matrix of how many times each pair are observed together and create a 'distances apart' matrix from it:
observedTogether=np.array([[(observations[g1]==observations[g2]).sum()
for g2 in games] for g1 in games])
distancesMatrix=1-observedTogether/observedTogether.max()
Then use this as input to a clustering algorithm:
from scipy.cluster.hierarchy import dendrogram, linkage
Z = linkage(distancesMatrix, 'single')
dendrogram(Z, orientation='right', labels=games)
plt.show()
Which gives output like the following:
Trouble is, the resulting clustering doesn't look very convincing with the real data, and that warning message makes me suspect that I'm doing something wrong. Googling the message suggests that other people are also doing things wrong, since I can't find an explanation what it means; undeniably the message is correct - the matrix is a distance matrix.
What should I be doing instead?