0

Suppose I am using this data set of movie ratings: http://www.grouplens.org/node/73

It contains ratings in a file formatted as userID::movieID::rating::timestamp

Given this, I want to construct a feature matrix in R project, where each row corresponds to a user and each column indicates the rating that the user gave to the movie (if any).

Example, if the data file contains

1::1::1::10
2::2::2::11
1::2::3::12
2::1::5::13
3::3::4::14

Then the output matrix would look like:

UserID, Movie1, Movie2, Movie3
1, 1, 3, NA
2, 5, 2, NA
3, NA, NA, 3

So is there some built-in way to achieve this in R project. I wrote a simple python script to do the same thing but I bet there are more efficient ways to accomplish this.

Iterator
  • 20,250
  • 12
  • 75
  • 111
Dan Q
  • 2,227
  • 3
  • 25
  • 36
  • reshape most likely would work, or something on the plyr package – aatrujillob Jan 17 '12 at 01:28
  • In addition to using sparse matrices, I'd recommend looking at other questions on R and sparse matrices, to get an idea of related issues: http://stackoverflow.com/questions/tagged/r+sparse-matrix – Iterator Jan 19 '12 at 13:01

3 Answers3

3

You can use the dcast function, in the reshape2 package, but the resulting data.frame may be huge (and sparse).

d <- read.delim(
  "u1.base", 
  col.names = c("user", "film", "rating", "timestamp")
)
library(reshape2)
d <- dcast( d, user ~ film, value.var = "rating" )

If your fields are separated by double colons, you cannot use the sep argument of read.delim, which has to be only one character. If you already do some preprocessing outside R, it is easier to do it there (e.g., in Perl, it would just be s/::/\t/g), but you can also do it in R: read the file as a single column, split the strings, and concatenate the result.

d <- read.delim("a")
d <- as.character( d[,1] )   # vector of strings
d <- strsplit( d, "::" )     # List of vectors of strings of characters
d <- lapply( d, as.numeric ) # List of vectors of numbers
d <- do.call( rbind, d )     # Matrix
d <- as.data.frame( d )
colnames( d ) <- c( "user", "movie", "rating", "timestamp" )
Vincent Zoonekynd
  • 31,893
  • 5
  • 69
  • 78
  • this almost works! just one minor problem, the separators on my file are double colons "::" but it seems R complains about them. Is there a way around this or will I have to just perform a simple replacement in the file? – Dan Q Jan 17 '12 at 06:39
  • @DanQ: I have updated the answer to deal with your file format. – Vincent Zoonekynd Jan 17 '12 at 06:58
  • 1
    Since the matrix will likely be very sparse, have a look at the various packages for dealing with spare matrixes. – Has QUIT--Anony-Mousse Jan 17 '12 at 07:27
  • My main reason for constructing this matrix is to perform k-means clustering over the row vectors, using the R implementation. – Dan Q Jan 17 '12 at 16:03
  • I second @Anony-Mousse's recommendation. Dealing with sparse data any other way is not how it's usually done (i.e. it's naive to waste space representing sparse data in a dense matrix). – Iterator Jan 19 '12 at 12:47
  • 1
    However, AFAICT k-means in R only works with dense matrixes. So you might need another k-means implementation, too. (But k-means is old crap anyway, and you might want to do spherical k-means or something here, too) – Has QUIT--Anony-Mousse Jan 19 '12 at 15:20
  • @Anony-Mousse Quite true - if one can't find a k-means that works with sparse matrices in R, then it should be written anew or just ignored. It is only worthwhile as an old method. In any case, one can store the sparse data "correctly" and then convert to a "full" matrix on demand. – Iterator Jan 20 '12 at 12:42
0

From the web site pointed to in a previous question, it appears that you want to represent

> print(object.size(integer(10000 * 72000)), units="Mb")
2746.6 Mb

which should be 'easy' with 8 GB you reference in another question. Also, the total length is less than the maximum vector length in R, so that should be ok too. But see the end of the response for an important caveat!

I created, outside R, a tab-delimited version of the data file. I then read in the information I was interested in

what <- list(User=integer(), Film=integer(), Rating=numeric(), NULL)
x <- scan(fl, what)

the 'NULL' drops the unused timestamp data. The 'User' and 'Film' entries are not sequential, and numeric() on my platform take up twice as much memory as integer(), so I converted User and Film to factor, and Rating to integer() by doubling (original scores are 1 to 5 in increments of 1/2).

x <- list(User=factor(x$User), Film=factor(x$Film),
          Rating=as.integer(2 * x$Rating))

I then allocated the matrix

ratings <- matrix(NA_integer_ ,
                 nrow=length(levels(x$User)),
                 ncol=length(levels(x$Film)),
                 dimnames=list(levels(x$User), levels(x$Film)))

and use the fact that a two-column matrix can be used to index another matrix

ratings[cbind(x$User, x$Film)] <- x$Rating

This is the step where memory use is maximum. I'd then remove unneeded variable

rm(x)

The gc() function tells me how much memory I've used...

> gc()
            used   (Mb) gc trigger   (Mb)  max used   (Mb)
Ncells    140609    7.6     407500   21.8    350000   18.7
Vcells 373177663 2847.2  450519582 3437.2 408329775 3115.4

... a little over 3 Gb, so that's good.

Having done that, you'll now run in to serious problems. kmeans (from your response to questions on an earlier earlier answer) will not work with missing values

> m = matrix(rnorm(100), 5)
> m[1,1]=NA
> kmeans(m, 2)
Error in do_one(nmeth) : NA/NaN/Inf in foreign function call (arg 1)

and as a very rough rule of thumb I'd expect ready-made R solutions to requires 3-5 times as much memory as the starting data size. Have you worked through your analysis with a smaller data set?

Community
  • 1
  • 1
Martin Morgan
  • 45,935
  • 7
  • 84
  • 112
  • the dcast function allows one to choose a filler value (I chose 0) so that solves the missing values. However, you are right, I am concerned about how much memory the vectors will take up, as well as the necessary memory to execute k-means over them...I don't know if 8GB will be enough. (I was able to process the 1M data set from grouplens though) – Dan Q Jan 20 '12 at 09:00
0

Quite simply, you can represent it as a sparse matrix, using sparseMatrix from the Matrix package.

Just create a 3 column coordinate object list, i.e. in the form (i, j, value), say in a data.frame named myDF. Then, execute mySparseMat <- sparseMatrix(i = myDF$i, j = myDF$j, x = myDF$x, dims = c(numRows, numCols) - you need to decide the number of rows and columns, else the maximum indices will be used to decide the size of the matrix.

It's just that simple. Storing sparse data in a dense matrix is inappropriate, if not grotesque.

Iterator
  • 20,250
  • 12
  • 75
  • 111