3

I would like to create a matrix that indicates group membership from a dataframe. For example, a NxN matrix where 1 means a neighborhood is within the same city as another neighborhood and 0 means the neighborhoods are part of a different city. For example:

hoodid <- c(1:10) 
cityid <- c(1, 1, 1, 2, 2, 3, 3, 3, 3, 3)
df <- data.frame(hoodid, cityid)
df

#    hoodid cityid
# 1       1      1
# 2       2      1
# 3       3      1
# 4       4      2
# 5       5      2
# 6       6      3
# 7       7      3
# 8       8      3
# 9       9      3
# 10     10      3

The desired outcome is:

# 0 1 1 0 0 0 0 0 0 0
# 1 0 1 0 0 0 0 0 0 0
# 1 1 0 0 0 0 0 0 0 0 
# 0 0 0 0 1 0 0 0 0 0
# 0 0 0 1 0 0 0 0 0 0 
# 0 0 0 0 0 0 1 1 1 1
# 0 0 0 0 0 1 0 1 1 1 
# 0 0 0 0 0 1 1 0 1 1 
# 0 0 0 0 0 1 1 1 0 1 
# 0 0 0 0 0 1 1 1 1 0
bcrew
  • 63
  • 3
  • 1
    You might be interested in the igraph package, designed for this sort of thing – Frank May 03 '16 at 16:33
  • Can you give any more clue as to which part of igraph would help? – bcrew May 03 '16 at 16:39
  • `from_adjacency` will convert your adjacency matrix into a graph. From there, you can take advantage of the graph algorithms that are usually used for analysis of data like this. – Frank May 03 '16 at 16:45
  • 1
    I'll look into that too. Thanks for the help Frank. – bcrew May 03 '16 at 16:52
  • You might find [this post](http://stackoverflow.com/questions/19891278/r-table-of-interactions-case-with-pets-and-houses) helpful too; `tcrossprod(table(df))` – alexis_laz May 03 '16 at 18:14

2 Answers2

5

This works:

library(Matrix)
m = do.call(bdiag, lapply(
  lengths(split(df$cityid, df$cityid)), 
  function(n) 1 - diag(n)
))

# 10 x 10 sparse Matrix of class "dgCMatrix"
#                          
#  [1,] . 1 1 . . . . . . .
#  [2,] 1 . 1 . . . . . . .
#  [3,] 1 1 . . . . . . . .
#  [4,] . . . . 1 . . . . .
#  [5,] . . . 1 . . . . . .
#  [6,] . . . . . . 1 1 1 1
#  [7,] . . . . . 1 . 1 1 1
#  [8,] . . . . . 1 1 . 1 1
#  [9,] . . . . . 1 1 1 . 1
# [10,] . . . . . 1 1 1 1 .

This assumes that your data is sorted by cityid first and doesn't have duplicates or any other oddities.

You can as.matrix(m) if you want a vanilla matrix.

Frank
  • 66,179
  • 8
  • 96
  • 180
  • 1
    Excellent. This is right on, assuming the sorting on cityid (the group variable) first. Thanks! – bcrew May 03 '16 at 16:36
1

I had a similar problem. Frank's solution worked for me, but I wanted to come up with a more general solution. Frank's solution requires group member's to be ordered. Also, if you create a very large matrix (as I did), lapply leave a lot of cache in the memory that cannot be cleaned with garbage collection (gc()).

Required packages: igraph and data.table (not necessary, but it is faster).

library(igraph)
library(Matrix)
library(data.table)

hoodid <- c(1:10) 
cityid <- c(1, 1, 1, 2, 2, 3, 3, 3, 3, 3)
df <- data.frame(hoodid, cityid)
df
# hoodid cityid
# 1       1      1
# 2       2      1
# 3       3      1
# 4       4      2
# 5       5      2
# 6       6      3
# 7       7      3
# 8       8      3
# 9       9      3
# 10     10      3

city_list = unique(df$cityid)
edges = list()
for (i in 1:length(city_list)) {
   edges[[i]] = data.table(t(combn(df[df$cityid == city_list[i], 'hoodid'], 2)))
}
edges = rbindlist(edges)

g = graph_from_edgelist(as.matrix(edges), directed = F)
g = get.adjacency(g)
g

# 10 x 10 sparse Matrix of class "dgCMatrix"
# 
# [1,] . 1 1 . . . . . . .
# [2,] 1 . 1 . . . . . . .
# [3,] 1 1 . . . . . . . .
# [4,] . . . . 1 . . . . .
# [5,] . . . 1 . . . . . .
# [6,] . . . . . . 1 1 1 1
# [7,] . . . . . 1 . 1 1 1
# [8,] . . . . . 1 1 . 1 1
# [9,] . . . . . 1 1 1 . 1
# [10,] . . . . . 1 1 1 1 .

Without data.table

library(igraph)
library(Matrix)

hoodid <- c(1:10) 
cityid <- c(1, 1, 1, 2, 2, 3, 3, 3, 3, 3)
df <- data.frame(hoodid, cityid)
df
# hoodid cityid
# 1       1      1
# 2       2      1
# 3       3      1
# 4       4      2
# 5       5      2
# 6       6      3
# 7       7      3
# 8       8      3
# 9       9      3
# 10     10      3

edges = data.frame(matrix(ncol = 2, nrow = 0))
for (i in unique(df$cityid)) {
  edges = rbind(edges, t(combn(df[df$cityid == i, 'hoodid'], 2)))
}

g = graph_from_edgelist(as.matrix(edges), directed = F)
g = get.adjacency(g)
g

# 10 x 10 sparse Matrix of class "dgCMatrix"
# 
# [1,] . 1 1 . . . . . . .
# [2,] 1 . 1 . . . . . . .
# [3,] 1 1 . . . . . . . .
# [4,] . . . . 1 . . . . .
# [5,] . . . 1 . . . . . .
# [6,] . . . . . . 1 1 1 1
# [7,] . . . . . 1 . 1 1 1
# [8,] . . . . . 1 1 . 1 1
# [9,] . . . . . 1 1 1 . 1
# [10,] . . . . . 1 1 1 1 .
Ali Furkan
  • 21
  • 1