2

I am attempting to transform columns to rows in R, without the use of reshape (can't install the package). The data I have received includes attributes and their corresponding metrics. I would like to calculate the statistical correlation between all of these attributes--16000 in total with 8 million records. Not all records have the same number of attributes.

To do this, I believe I will have to convert columns to rows so that I can eventually use the cor function e.g. cor(x[,1], x[,2:16000]). This may be wholly unnecessary if there is some way to use the cor function by attribute i.e. correlation between attribute 1 and 2, attribute 1 and 3, attribute 1 ... N. Any help would be much appreciated.

 ID          Attribute  Metric1 
 [1,]  1         1 -1.6363007
 [2,]  2         1  1.1483294
 [3,]  3         1  2.1682566
 [4,]  4         1 -1.1823649
 [5,]  5         1 -1.3631378
 [6,]  1         2 -1.1715544
 [7,]  2         2  1.5164278
 [8,]  3         2 -1.0110274
 [9,]  4         2 -0.9421652
[10,]  5         2 -0.2105443
[11,]  6         2 -0.4143548
[12,]  7         2 -1.6170975
[13,]  8         2  1.2402303
[14,]  9         2  0.4460047
[15,]  7         3  0.1060407
[16,]  8         3  0.9796893
[17,]  9         3  0.9254911
[18,] 10         3 -1.5728600
[19,] 11         3 -0.8082675
[20,] 12         3 -1.8643084

Transformation:

ID  attribute1  attribute2  attribute3
1   -1.6363007  -1.1715544  na
2   1.1483294   1.5164278   na
3   2.1682566   -1.0110274  na
4   -1.1823649  -0.9421652  na
5   -1.3631378  -0.2105443  na
6   na          -0.4143548  na
7   na          -1.6170975  0.1060407
8   na           1.2402303  0.9796893
9   na           0.4460047  0.9254911
10  na           na         -1.57286
11  na           na         -0.8082675
12  na           na         -1.8643084


test <- cbind(c(rep(1,5),rep(2,9),rep(3,6)), replicate(1,rnorm(20)))
test <- cbind(c(1:5,1:9,7:12),test)

@Aaron

q <- matrix(nrow=20,ncol=3)
colnames(q) <- c("x","y","z")
q[,3] <- replicate(1, rnorm(20))
q[,2] <- c(101,102,103,104,105,106, 107, 108, 101,103,107,109, 104,110,102,103,106,109,108,112)
q[15:20,1] <- 10000003
q[9:14,1] <- 10000002
q[1:8,1] <- 10000001
q <- data.frame(q)
q$x <- factor(q$x)
q$y <- factor(q$y)
q$z <- factor(q$z)

with(q, {
  out <- matrix(nrow=nlevels(x), ncol=nlevels(y),
                dimnames=list(levels(x), levels(y)))
  out[cbind(x, y)] <- z
  out
})
user1636475
  • 113
  • 2
  • 11

2 Answers2

5

No need for the "reshape" or "reshape2" packages. Just use base R reshape(). Assuming your data.frame is named "temp":

reshape(temp, direction = "wide", idvar="ID", timevar="Attribute")
#       ID Metric1.1  Metric1.2  Metric1.3
# [1,]   1 -1.636301 -1.1715544         NA
# [2,]   2  1.148329  1.5164278         NA
# [3,]   3  2.168257 -1.0110274         NA
# [4,]   4 -1.182365 -0.9421652         NA
# [5,]   5 -1.363138 -0.2105443         NA
# [11,]  6        NA -0.4143548         NA
# [12,]  7        NA -1.6170975  0.1060407
# [13,]  8        NA  1.2402303  0.9796893
# [14,]  9        NA  0.4460047  0.9254911
# [18,] 10        NA         NA -1.5728600
# [19,] 11        NA         NA -0.8082675
# [20,] 12        NA         NA -1.8643084

If your data are a matrix and not a data.frame, you will need to convert it to a data.frame before using reshape(), or you can use xtabs(). However, using xtabs() creates zeroes instead of NAs. Here's the xtabs() approach:

xtabs(Metric1 ~ ID + Attribute, tempm)
#     Attribute
# ID            1          2          3
#   1  -1.6363007 -1.1715544  0.0000000
#   2   1.1483294  1.5164278  0.0000000
#   3   2.1682566 -1.0110274  0.0000000
#   4  -1.1823649 -0.9421652  0.0000000
#   5  -1.3631378 -0.2105443  0.0000000
#   6   0.0000000 -0.4143548  0.0000000
#   7   0.0000000 -1.6170975  0.1060407
#   8   0.0000000  1.2402303  0.9796893
#   9   0.0000000  0.4460047  0.9254911
#   10  0.0000000  0.0000000 -1.5728600
#   11  0.0000000  0.0000000 -0.8082675
#   12  0.0000000  0.0000000 -1.8643084 
A5C1D2H2I1M1N2O1R2T1
  • 190,393
  • 28
  • 405
  • 485
  • When I try to implement this line I am prompted with the error message: "Error in match(rval[, idvar], thistime[, idvar]) : subscript out of bounds" – user1636475 Jan 07 '13 at 19:05
  • @user1636475, is your input a `data.frame` or a `matrix`? Try converting it to a `data.frame` before using the example code provided. – A5C1D2H2I1M1N2O1R2T1 Jan 07 '13 at 19:07
  • I thought that was the issue, but definitely a matrix. temp<- cbind(c(rep(1,5),rep(2,9),rep(3,6)), replicate(1,rnorm(20))) temp <- cbind(c(1:5,1:9,7:12),test) colnames(temp) <- c("ID", "Attribute", "Metric1") reshape(temp, direction = "wide", idvar="ID", timevar="Attribute") – user1636475 Jan 07 '13 at 19:12
  • @user1636475, as I mentioned, `reshape()` expects a `data.frame`, so try `reshape(data.frame(YOURMATRIXNAME, etc` instead. – A5C1D2H2I1M1N2O1R2T1 Jan 07 '13 at 19:14
  • 1
    @user1636475, you may want to also start accepting answers to some of your questions. I notice that you've asked four questions on Stack Overflow, but have not marked any of the answers in them as accepted (look for the checkmark next to the answers). Of course, do this only if the answers meet your expectations. – A5C1D2H2I1M1N2O1R2T1 Jan 07 '13 at 19:24
0

using reshape2

  acast(dat, ID~Attribute, value.var="Metric1")
           1          2          3
1  -1.636301 -1.1715544         NA
2   1.148329  1.5164278         NA
3   2.168257 -1.0110274         NA
4  -1.182365 -0.9421652         NA
5  -1.363138 -0.2105443         NA
6         NA -0.4143548         NA
7         NA -1.6170975  0.1060407
8         NA  1.2402303  0.9796893
9         NA  0.4460047  0.9254911
10        NA         NA -1.5728600
11        NA         NA -0.8082675
12        NA         NA -1.8643084
Germstorm
  • 9,709
  • 14
  • 67
  • 83
agstudy
  • 119,832
  • 17
  • 199
  • 261