11

I'm trying to run some R code and it is crashing because of memory. The error that I get is:

Error in sendMaster(try(lapply(X = S, FUN = FUN, ...), silent = TRUE)) : 
  long vectors not supported yet: memory.c:3100

The function that creates the troubles is the following:

StationUserX <- function(userNDX){
  lat1 = deg2rad(geolocation$latitude[userNDX])
  long1 = deg2rad(geolocation$longitude[userNDX])
  session_user_id = as.character(geolocation$session_user_id[userNDX])
  #Find closest station
  Distance2Stations <- unlist(lapply(stationNDXs, Distance2StationX, lat1, long1))
  # Return index for closest station and distance to closest station
  stations_userX = data.frame(session_user_id = session_user_id, 
                              station = ghcndstations$ID[stationNDXs], 
                              Distance2Station = Distance2Stations)    
  stations_userX = stations_userX[with(stations_userX, order(Distance2Station)), ]
  stations_userX = stations_userX[1:100,] #only the 100 closest stations...
  row.names(stations_userX)<-NULL
  return(stations_userX)
}

I run this function using mclapply 50k times. StationUserX is calling Distance2StationX 90k times.

Is there an obvious way to optimize the function StationUserX ?

Ignacio
  • 7,646
  • 16
  • 60
  • 113

2 Answers2

14

mclapply is having trouble sending back all the data from worker threads into the main thread. That's because of prescheduling, where it runs large number of iterations per thread, and then syncs all the data back. That's nice and fast, but results in >2GB of data being sent back, which it can't do.

Run mclapply with mc.preschedule=F to turn off pre-scheduling. Now, each iteration will spawn its own thread and will return its own data. It won't go quite as fast, but it gets around the problem.

Stan
  • 1,227
  • 12
  • 26
-1

Try using nextElem() from the iterators package. It acts like a "generator" in Python, so you don't have to load the entire list into memory.

Community
  • 1
  • 1
rsoren
  • 4,036
  • 3
  • 26
  • 37
  • I'm looking at the man page for nextElem but I don't really get how i should modify my function to use it. Can you show me? Thanks! – Ignacio Apr 22 '14 at 22:28
  • Can you tell which object is the "long vector" that is causing problems? The idea is that you'll use ```nextElem()``` to pass the elements of the vector one-by-one, rather than passing the entire vector in one go. – rsoren Apr 22 '14 at 22:47
  • The error message does not really say. I'm calling StationUserX with mclapply and passing StationUserX. StationUserX Is a vector of 50k observations. The object that the function produces, stations_userX is a really big object. I have 100 stations times 50k users. So the output is going to have 5 Millon rows and 3 columns. – Ignacio Apr 22 '14 at 22:57
  • Ah I see. ```mclapply``` is definitely having trouble giving you back the *output* because it's too large. ```nextElem()``` isn't going to help with that. Another alternative is to find out what ```mclapply``` thinks is a "long vector" and work in chunks that are smaller than that. Sorry. – rsoren Apr 22 '14 at 23:14