5

This is an extension to an existing question: Convert table into matrix by column names

I am using the final answer: https://stackoverflow.com/a/2133898/1287275

The original CSV file matrix has about 1.5M rows with three columns ... row index, column index, and a value. All numbers are long integers. The underlying matrix is a sparse matrix about 220K x 220K in size with an average of about 7 values per row.

The original read.table works just fine.

  x <- read.table("/users/wallace/Hadoop_Local/reference/DiscoveryData6Mo.csv", header=TRUE);

My problem comes when I do the reshape command.

  reshape(x, idvar="page_id", timevar="reco", direction="wide")

The CPU hits 100% and there it sits forever. The machine (a mac) has more memory than R is using. I don't see why it should take so long to construct a sparse matrix.

I am using the default matrix package. I haven't installed anything extra. I just downloaded R a few days ago, so I should have the latest version.

Suggestions?

Thanks, Wallace

Community
  • 1
  • 1
Wallace
  • 51
  • 1
  • 2
  • You should give `sparseMatrix` from the `Matrix` package a try. – flodel Mar 23 '12 at 01:45
  • 3
    The `reshape` function is not designed to construct a spars- matrix no matter what sacrifices you make to the _deus_ex_machina_. And there is no "matrix" package. If you are asking about the "Matrix" package, then please spell it correctly. – IRTFM Mar 23 '12 at 01:49
  • 1
    http://stackoverflow.com/a/9617424/210673 has a list of the various ways to do this. – Aaron left Stack Overflow Mar 23 '12 at 15:54

2 Answers2

5

I would use the sparseMatrix function from the Matrix package. The typical usage is sparseMatrix(i, j, x) where i, j, and x are three vectors of same length: respectively, the row indices, col indices, and values of the non-zero elements in the matrix. Here is an example where I have tried to match variable names and dimensions to your specifications:

num.pages <- 220000
num.recos <- 230000
N         <- 1500000

df <- data.frame(page_id = sample.int(num.pages, N, replace=TRUE),
                 reco    = sample.int(num.recos, N, replace=TRUE),
                 value   = runif(N))
head(df)
#   page_id   reco     value
# 1   33688  48648 0.3141030
# 2   78750 188489 0.5591290
# 3  158870  13157 0.2249552
# 4   38492  56856 0.1664589
# 5   70338 138006 0.7575681
# 6  160827  68844 0.8375410

library("Matrix")
mat <- sparseMatrix(i = df$page_id,
                    j = df$reco,
                    x = df$value,
                    dims = c(num.pages, num.recos))
flodel
  • 87,577
  • 21
  • 185
  • 223
3

The simplest way to do this in base R is with matrix indexing, like this:

# make up data
num.pages <- 100
num.recos <- 100
N <- 300
set.seed(5)
df <- data.frame(page_id = sample.int(num.pages, N, replace=TRUE),
                 reco    = sample.int(num.recos, N, replace=TRUE),
                 value   = runif(N))

# now get the desired matrix
out <- matrix(nrow=num.pages, ncol=num.recos)
out[cbind(df$page_id, df$reco)] <- df$value

However, in this case, your resulting matrix would be 220k*220k, which would take more memory than you have, so you need to use a package specifically for sparse matrices, as @flodel describes.

Aaron left Stack Overflow
  • 36,704
  • 7
  • 77
  • 142