0

A bit stuck thinking and reading about this..

  • Having a dataframe with about 8x10^6 rows
  • and with roughly 40 categories in which I'm interested
  • I'm trying two things (apologies for posting them together, but they seem highly related)
  • in the first place, I'm looking for an efficient way to randomly sample 100 rows from each category, i.e. var1 (which goes from 01 to 40)
  • ideally, I'd create a new dataframe with about 400 rows (instead of 8 million)
  • in the second place, I'd like to be able to take the average of all the var2 and var3 values, per var1 (being equal category that is)

Perhaps these are related in terms of methods.

My dataframe looks something like this (an oversimplification)

              var1     var2     var3     var3
1             01       949.47   ..       ..
2             01       935.09   ..       ..
3             01       935.01   ..       ..
4             01       355.39   ..       ..
5             01       455.07   ..       ..
6             01       525.08   ..       ..
..
250000        02       485.82   ..       ..
250001        02       204.14   ..       ..
250002        02       388.22   ..       ..
..

I've tried splitting the dataframe in a for-loop, but this doesn't succeed (never ends, and I need to kill the process).

for (i in 1:8000000){
   out <- split(dat, f = dat$var1)
}

Also, I'm not sure what to do next, how to manage all the seperate dataframes, and whether this is the best method.

Many thanks for any tips!

nick88
  • 118
  • 1
  • 8
  • Thanks! I tried, but I get an error: unused argument (by = var1) – nick88 Nov 25 '18 at 20:39
  • 1
    Sounds like you are trying to use `data.table` functions on a `data.frame`. Use `dt <- as.data.table(df)` to make a second data set which is a `data.table`, or use `setDT(df)` to update your data frame by reference. – Henrik Nov 25 '18 at 22:08

0 Answers0