0

I have a relatively short script that takes a large dataframe (2373142 rows x 21 columns) of numeric and string fields and breaks it into a list of dfs based on values of one of the columns. The length of the list using this dataset ends up being 92 and is then run through a function from the Physical Activity package using lapply. The script works perfectly on smaller datasets but on one this large it maxes out the memory. I even tried breaking it up and running smaller and smaller lists, but it maxes out even with just a two item subset of the original list. I should add that my computer has 16GB of ram, all of which R has access to.

I'm at a loss of how to make it more efficient since I'm not using any loops, but I was hoping that someone more R savvy than I had some suggestions on efficiency. I'm worried that it's the wearingMarking package function that's causing the trouble, but I'm not sure. My data is sensitive, so unfortunately I can't provide a sample. My apologies as I know that is far from ideal and is restrictive, but any help would be greatly appreciated.

allData <- read.csv("myData.csv", header = TRUE) # Loading data

chngActivity <- allData[,c("activity")] #Creating a duplicate of activityIntensity column
chngActivity[chngActivity == -2] <- 0
allData <- cbind(allData, chngActivity)#Binding the new column to the old df

corTime <- transform(allData, dateTime=strptime(allData$dateTime, "%m/%d/%y %H:%M"))# Making sure the dateTime is set as a date
corTimeLst <- split(corTime, corTime$identifier) #Splitting into a list of dfs by identifier
rm(allData, corTime)

allChoi <- function(f) {
  choi_test <- wearingMarking(dataset = f, #Running the choi
                          frame = 90,        #The current parameters are set to
                          perMinuteCts = 1,  # a one minute epoch with the new
                          TS = "dateTime",   # non-wear column called "wearing"
                          cts = "chngActivity",
                          streamFrame = NULL,
                          allowanceFrame= 3,
                          newcolname = "wearing")
  return(choi_test)
}


choiRun <- lapply(corTimeLst, allChoi)#applying the function to each participant on the list
choiFlat <- ldply(choiRun, data.frame)#Flattening the list into a df
Misc
  • 645
  • 1
  • 7
  • 20
  • what is twoList? why are you writing over choiRun? – Andrew Cassidy Jan 29 '14 at 18:43
  • Whoops, sorry, that was me testing a smaller subset and I forgot to take it out. Just made the correction. – Misc Jan 29 '14 at 18:45
  • possible duplicate of [Trimming a huge (3.5 GB) csv file to read into R](http://stackoverflow.com/questions/3094866/trimming-a-huge-3-5-gb-csv-file-to-read-into-r) – krlmlr Jan 29 '14 at 19:59
  • Look especially for `ff` and `bigmemory`, and http://cran.r-project.org/web/views/HighPerformanceComputing.html (as seen in http://stackoverflow.com/q/11055502/946850). – krlmlr Jan 29 '14 at 20:01
  • Ok, thanks @krlmlr. I've used ff before with dataframes. I'll try to use it with my lists. – Misc Jan 29 '14 at 20:28
  • @Misc: I'd try keeping the entire data as an `ffdf` and then run `ddply` on it. – krlmlr Jan 29 '14 at 20:38
  • But before starting heavy machinery, get rid of `corTime`. Using a second variable for the big data frame increases memory requirements. Why don't you simply write `allData$dateTime <- ...`, same for the `cbind` call above? Remember that a data frame internally is a list of column vectors. -- Also, check out the `lsos()` function. – krlmlr Jan 29 '14 at 20:43
  • Great, thank you. I'll try the stream lining suggestions first and then move to ff is necessary. – Misc Jan 29 '14 at 21:36
  • @krlmlr thanks so much, the streamlining worked and I didn't even need ff! – Misc Jan 30 '14 at 17:03

0 Answers0