0

Let's say I have a dataframe with three of its columns being

> df
A    B    C
1232    27.3    0.42
1232    27.3    0.36
1232    13.1    0.15
7564    13.1    0.09
7564    13.1    0.63

The required output is:

        [1232]    [7564]
[13.1]   0.15    0.36
[27.3]   0.39    0

I need to make a matrix with unique values in A and B as my rows and columns. The value for any cell in the matrix is to be calculated by subsetting the original dataframe for the particular value of A and B and calculating the mean of column C.

My code is:

mat <- matrix(rep(0), length(unique(df$A)), nrow = length(sort(unique(df$B))))
# sort is to avoid NA
colnames(mat) <- unique(df$A)
rownames(mat) <- unique(df$B)

for (row in rownames(mat)) {
    for (col in colnames(mat)) {
        x <- subset(df, A == col & B == row)
        mat[row, col] = mean(df$C)
     }
}

This is very slow, considering I have to deal with a matrix that has thousands of rows and columns. How can I make this run faster?

Sachin
  • 73
  • 1
  • 7
  • 1
    We need a reproducible example to solve your problem because as it is we could answer to a problem that is not what you are facing in real life. Also explicitly present what is the expected solution. – Seymour Jun 29 '18 at 07:21

3 Answers3

1

You can use a combination of aggregate() and xtabs():

df <- read.table(header=TRUE, stringsAsFactors = FALSE, text=
"A    B    C
1232    27.3    0.42
1232    27.3    0.36
1232    13.1    0.15
7564    13.1    0.09
7564    13.1    0.63")
xtabs(C ~ B + A, data=aggregate(C ~ B + A, data=df, FUN=mean))
# > xtabs(C ~ B + A, data=aggregate(C ~ B + A, data=df, FUN=mean))
#       A
# B      1232 7564
#   13.1 0.15 0.36
#   27.3 0.39 0.00

For other solutions read: How to reshape data from long to wide format?

jogo
  • 12,469
  • 11
  • 37
  • 42
1

Tidyverse solution:

library(tidyverse)

df %>% 
  group_by(A, B) %>%
  summarise(C = mean(C)) %>%
  spread(A, C)
Shinobi_Atobe
  • 1,793
  • 1
  • 18
  • 35
0

You probably want something like this: (using data.table)

n <- 1e3
v <- LETTERS[1:5]
set.seed(42)
df <- data.frame(A = sample(v, n, replace = T),
                 B = sample(v, n, replace = T),
                 C = sample.int(1e2, n, replace = T))


require(data.table)
dt <- as.data.table(df)
r <- dt[, .(v = mean(C)), keyby = .(A, B)] # calculate mean for each combination
r <- dcast(r, B ~ A, value.var = 'v') # transform to your structure
rmat <- as.matrix(r[, -1]) # to matrix
rownames(rmat) <- r[[1]] # add row names
rmat[1:5, 1:5]
#          A        B        C        D        E
# A 53.00000 42.71739 53.11538 49.35000 53.14286
# B 50.62745 58.41379 60.43590 48.75000 56.56410
# C 43.75000 42.93548 55.45000 52.63415 44.27907
# D 50.00000 49.84314 57.48276 50.37143 53.16667
# E 43.95122 55.46667 55.38095 43.85366 53.22222

P.S. your posted code was not correct. the loops should be:

for (row in rownames(mat)) {
  for (col in colnames(mat)) {
    x <- subset(df, A == col & B == row)
    mat[row, col] = mean(x$C)
  }
}

P.S.S. the loop could be optimized like:

for (row in rownames(mat)) {
  for (col in colnames(mat)) {
    i <- (df$A == col & df$B == row)
    mat[row, col] <- mean(df[i, 'C'])
  }
}
minem
  • 3,640
  • 2
  • 15
  • 29