0

I am new to igraph and social network analysis, but not to R.

I am struggling to correctly structure my dataset for community detection, but have successfully used iGraph to generate a co-occurence matrix as directed [here]. What I would like to do next is use a community detection algorithm on the same dataset to create a graph showing clusters as is done in the answer here.

The sample code for how to do this is as follows:

df1 <- graph.famous("Zachary")
df2 <- walktrap.community(df1) #any algorithm
plot.communities(df2, df)

I've been poking around on the web to find out the structure of the Zachary dataset so I can correctly model my data, but am struggling to find my way through the technical documentation.

My data is currently structured in long form, such that:

id         interest    comments
1             Comedy          2
1  Music: Electronic         11
1       Video Gaming         10
1         Music: Pop          1
1      Entertainment          1
1       Video Gaming          4
2       Video Gaming         45
2      Entertainment         26
2         Music: Pop          1
2            Comedy         14
3      Video Gaming         10
3     Entertainment          4
3            Comedy          8
4      Video Gaming          9
4 Music: Electronic          1
4        Music: Pop          2
5        Music: Pop          2
5     Entertainment          1
5      Video Gaming          1
6      Video Gaming         12

I am trying to find clusters of overlapping interest in the population I am studying, so the ID is a person, the interests are the person's interests, and comments is an index of how many times they have shown interest. Does this help?

I've tried to run the community algorithms on this dataset (e.g. df2 <- walktrap.community(df)) but that doesn't seem to work correctly. Thoughts on what this n00b is doing wrong?

Community
  • 1
  • 1
roody
  • 2,633
  • 5
  • 38
  • 50
  • Graphs are made up of nodes and edges. What in your data set relates to nodes and how do you know which nodes are connected? For the Zarchary data, it can either be represented by an adjaceny matrix (`get.adjacency(df1)`) or an edge list (`get.edgelist(df1)`) – MrFlick May 22 '15 at 17:20
  • @MrFlick The nodes are the values in the column "interests" -- i.e. comedy, gaming, etc. I am trying to find clusters of overlapping interest in the population I am studying, so the ID is a person, the interests are the person's interests, and "comments" is an index of how many times they have shown interest. Does this help? – roody May 22 '15 at 17:32
  • That make sense but that still doesn't translate well into the language of graphs. So you want each interest to be a node, and then you want an edge between nodes if a user shares both interests? I'm not sure what you'd do with the comments column since that doesn't seem to fit well as a node or edge weight. I'm still having a hard time seeing how this data would be represented in graph form. – MrFlick May 22 '15 at 17:46
  • @MrFlick I may have not communicated accurately...my end goal is to find unique clusters of people who have combinations of interests. So, there might be a cluster of people who like "gaming and entertainment", which might be related (but distinct from) people who like "gaming and comedy". The "comments" was meant to provide a relative weight of interest (and was used in the co-occurence analysis) but I'm really just trying to muddle my way through a suggested approach to the clustering... – roody May 22 '15 at 17:52
  • @roody if you feel like my answer was helpful or answered your question, please upvote/accept it. Otherwise, please don't hesitate to comment for clarifications. – Antoine Aug 10 '15 at 16:24

1 Answers1

0

A graph won't allow you to cluster individuals, but only to find out which variables are related. Still, if you want to build a graph from your data, here is what you would have to do. (Note that I have saved your sample data as a .csv file and uploaded it to dropbox to make for a readily reproducible example).

library(repmis)
library(igraph)

test=source_data("https://www.dropbox.com/s/bochkedd4o3gzvq/so.csv?dl=0")

First, what you want is to create a matrix with one row per individual and one column per feature:

matrix=matrix    
rownames(matrix)=unique(test[,1])
colnames(matrix)=unique(test[,2])

and where the values of the cells correspond to the strength of the interests:

for (i in 1:nrow(matrix)){
    temp=test[test[,1]==i,][,2:3]
    for (j in 1:ncol(matrix)){
    matrix[i,j]=sum(temp[temp[,1]==colnames(matrix)[j],2]) # sum is used because of duplicates
    }
}

What you get is:

> matrix
  comedy electronic gaming pop ent
1      2         11     14   1   1
2     14          0     45   1  26
3      8          0     10   0   4
4      0          1      9   2   0
5      0          0      1   2   1
6      0          0     12   0   0

Then, from that, you can create an adjacency matrix:

x=t(matrix)%*%matrix

And what you get is:

> x
           comedy electronic gaming pop  ent
comedy        264         22    738  16  398
electronic     22        122    163  13   11
gaming        738        163   2547  79 1225
pop            16         13     79  10   29
ent           398         11   1225  29  694

Building a graph from that is easy:

g=graph.adjacency(x,weighted=T,mode="undirected",diag=FALSE)
g=simplify(g)

You can apply any community detection algorithm to the object g, for instance:

spinglass.community(g,weight=E(g)$weight)

If you want to cluster individuals rather than variables, I would look at PCA and hierarchical clustering (see for instance the excellent HCPC function from the FactoMineR package). You will use in that case the object matrix above (no need to compute the adjacency matrix).

Antoine
  • 1,649
  • 4
  • 23
  • 50
  • Could you explain what the numbers in the matrix x represent, that results from the crossproduct please. – user20650 Jun 03 '15 at 21:09
  • @user20650 please check out the documentation of the `graph.adjacency` function: http://igraph.org/r/doc/graph.adjacency.html – Antoine Jun 04 '15 at 11:16
  • thanks for your response. To expand on my question, for example, what does the `264` relate to in the `comedy / comedy` element of `x` - i can t see how this relates to the original data (i know im prbably being slow) ps. i did understand that you were trying to form a weighted adjacency matrix. – user20650 Jun 04 '15 at 11:23
  • @user20650 the values themselves do not have any physical meaning, they are just weights that are meaningful in that they allow comparison, also, the matrix is symmetric and we don't care about the diagonal. The values indicate the strength of the co-occurrence between variables. For instance, *electronic* and *gaming* jointly occur much more frequently (163) than *electronic* and *comedy* (22). When creating a weighted graph, these numbers would determine the 'thickness' of the edges. – Antoine Jun 04 '15 at 11:34
  • @user20650 note that I corrected a typo: `x=t(matrix)%*%matrix` instead of `x=t(x)%*%x` – Antoine Jun 04 '15 at 11:37