0

I am working on MoveLens dataset. Everytime I try to use the reshape2 or tidyr it crashes my macbook pro. I wonder if it's because of the storage or the system.

It works fine when I only use the head of the dataset, the dataset has total of 2 million obs.

head(ratings)
 userID movieId rating timestamp
1 1      2        3.5   1112486027
2 1      29       3.5   1112484676 
3 1      32       3.5   1112484819
4 1      47       3.5   1112484727
5 1      50       3.5   1112484580
6 1      112      3.5   1094785740

rat_mat <- dcast(ratings,userId~movieId, value.var="rating",na.rm=FLASE)
#Create ratings matrix. Rows = userId, Columns = movieId
require(tidyr)
rat_mat<- spread(ratings,userId,rating)

Error Message:
"NAs introduced by coercion to integer range
 Show Traceback
Error in dim.data.table(x) : long vectors not supported yet: ../../../../R-3.4.1/src/include/Rinlinedfuns.h:138"
  • 3
    That doesn't look like what I would call a "crash". It looks like an error. The error message appears to come from `data.table` code and suggest the size of the object may be large enough to require a vector of length more than 2^32-1. I think it's the size of the problem that is the culprit. – IRTFM Apr 07 '18 at 15:06
  • 1
    Are you intentionally trying to create a dataset with over 138000 columns? In the 20M version of `ratings` (from https://grouplens.org/datasets/movielens/), it has 138493 distinct `userId` values. If you really need this, then I suggest you divide-and-conquer: split the data up by (say) 1000 user ids at once, spread them, and then either try to combine at the end (perhaps a mistake) or just deal with them split. ([list of dataframes](https://stackoverflow.com/questions/17499013/how-do-i-make-a-list-of-data-frames/24376207#24376207) might be useful). – r2evans Apr 07 '18 at 16:43
  • 1
    (And I concur with @42- ... and it will also be a memory-starvation problem (`Error: cannot allocate vector of size 20634.9 Gb`) unless you have more than I do (16GB).) – r2evans Apr 07 '18 at 16:46
  • The data is massive, I will subset the data for the modeling, thanks. – Albert TSAI Apr 08 '18 at 06:03

0 Answers0