-1

I have a datastructure which looks like below

 id  year   club
"A" "2010" "C1"
"B" "2010" "C1"
"C" "2010" "C2"
"A" "2011" "C1"
"B" "2011" "C2"

etc. My problem is to create a matrix on a year by year basis where the each unique individual is fixed to a row and each unique club is fixed to a column. If in a given year the individual visits the club the corresponding intersecting cell takes the value of 1 else 0.

Any help on this is much appreciated.

Thanks Anup

Anup
  • 31
  • 3
  • I'm a little confused by what you want here. Can you maybe provide a sample of how your output should look? I'm having trouble picturing how an "adjacency matrix" applies to this problem. – David Marx Jul 14 '13 at 15:49
  • Hi David. I'm actually having trouble trying to format the data. Maybe the term adjacency matrix is incorrect here. What I'm trying to do is set unique ids as rows and unique clubs as columns. If an id visits a club then the intersecting cell takes the value 1 or 0 otherwise. But I have to do this on a year by year basis. I hope this is clearer – Anup Jul 14 '13 at 15:52
  • When you say "year by year," do you want a separate matrix for each year, or do you want to count the number of years (or times) an id visits each club? – David Marx Jul 14 '13 at 15:53
  • No I need separate matrices for each year stacked one after the other. Therefore if we make two matrices for 2010 and 2011, then i guess I need to rbind(mat1, mat2). – Anup Jul 14 '13 at 15:55
  • Please give us an example of what you expect the solution to look like. You lack the language, which is ok, but use an alternative method to convey what you're after. – Tyler Rinker Jul 14 '13 at 16:06

3 Answers3

3

In base R, I would use by and xtabs:

by(dat, dat$year, with, as.matrix(xtabs(~ id + club) * 1L))

It returns a list of matrices (one per year). Also, I would recommend not multiplying by 1L to return matrices of booleans.


Edit1: as suggested, you can also create a contingency table quite easily:

table(dat[c("id", "club", "year")])

Edit2: I see your comment to Ricardo's answer, maybe this is what you are looking for:

library(plyr)
ddply(dat, .(year, id), with, 1L * (table(club) > 0))
#   year id C1 C2
# 1 2010  A  1  0
# 2 2010  B  1  0
# 3 2010  C  0  1
# 4 2011  A  1  0
# 5 2011  B  0  1

You might also want to use .drop = FALSE to get all (6) possible year/id combinations:

ddply(dat, .(year, id), with, 1L * (table(club) > 0), .drop = FALSE)
#   year id C1 C2
# 1 2010  A  1  0
# 2 2010  B  1  0
# 3 2010  C  0  1
# 4 2011  A  1  0
# 5 2011  B  0  1
# 6 2011  C  0  0
flodel
  • 87,577
  • 21
  • 185
  • 223
2

This one-line solution outputs a single dataframe instead of separate matrices like the other solutions thus far posted. It requires the reshape2 package, which if you've never used it before is extremely useful for this sort of thing.

require(reshape2)

df = data.frame(id=c('a','b','c','a','b')
               ,year=c(2010,2010,2010,2011,2011)
               ,club=c('c1','c1','c2','c1','c2')  
                )

result = dcast(df, year + id ~ club, fun.aggregate=length, drop=F)

> result

  year id c1 c2
1 2010  a  1  0
2 2010  b  1  0
3 2010  c  0  1
4 2011  a  1  0
5 2011  b  0  1
6 2011  c  0  0  # If you don't want this record, use: drop=T
David Marx
  • 8,172
  • 3
  • 45
  • 66
0

based on the comments, suitable formats are either a three-dimensional array or a list of data.frames / data.tables.

    library(data.table)
DT <- data.table

### create a template matrix
# find all unique ID's and unique Club's
ids <- unique(DT$id)
clubs <- unique(DT$club)
# matrix template based on the number of ids & clubs
mat.template <- matrix(0L, nrow=length(ids), ncol=length(clubs), dimnames=list(ids, clubs))

# iterate over the unique values of year
DT <- data.table(dat, key="id,club")
res <- lapply(unique(DT$year), function(y) {
    # index the matrix by the DT where y == year. Set those values to 1
        mat <- copy(mat.template)
        mat[as.matrix(DT[y==year, list(id, club)])] <- 1
        mat
    })

setattr(res, "names", unique(DT$year))

Results:

res     

$`2010`
  C1 C2
A  1  0
B  1  0
C  0  1

$`2011`
  C1 C2
A  1  0
B  0  1
C  0  0
Ricardo Saporta
  • 54,400
  • 17
  • 144
  • 178
  • Note: There's a function `CJ` for `data.table` that performs the job of `expand.grid`. – Arun Jul 14 '13 at 15:59
  • 1
    Hi Ricardo. Thanks for the solution but it looks I did not specify my problem correctly. I have now edited the problem statement. The expand.grid option is not giving me the dummy variable output that I was looking for. – Anup Jul 14 '13 at 16:01
  • thanks @Arun, I always forget about CJ – Ricardo Saporta Jul 14 '13 at 16:05
  • Hi Ricardo. Great it works just fine. One last question. Is it possible for me to get this into a four column matrix where Column 1 is the year, column 2 is the ID and columns 3 and 4 represent the dummy variables? – Anup Jul 14 '13 at 16:14
  • 1
    @Anup. First try not to change your mind regarding a question. But more important, if you do, make that updated request inside your question, not as a comment to someone's answer. – flodel Jul 14 '13 at 16:46
  • @ flodel: I agree. I apologize for the confusion in the question. Next time I will post a question with the expected output. I think that would have solved the problem. My bad. – Anup Jul 14 '13 at 16:53
  • @anup, the first answer I posted was more in line with what you were looking for. While flodel is correct, I would add that taking the time to think out your question will actually help clarify the problem for yourself as well as for others. This will save a lot of time for all involved :) Please don't let this deter you from asking more questions – Ricardo Saporta Jul 14 '13 at 17:02