I have a document d1 consisting of lines of form user_id tag_id. There is another document d2 consisting of tag_id tag_name I need to generate clusters of users with similar tagging behaviour. I want to try this with k-means algorithm in python. I am completely new to this and cant figure out how to start on this. Can anyone give any pointers?
Do I need to first create different documents for each user using d1 with his tag vocabulary? And then apply k-means algorithm on these documents? There are like 1 million users in d1. I am not sure I am thinking in right direction, creating 1 million files ?