0

I am trying to sample a data frame which contains two column one is ID one is count. the sum of count is 10^13 and I want to sample the size to 10^12 ,10^11 and so on. but

The vector exceeds the limits of R.

How can I sample this kind of data ?

df_random[[i]]= df2  %>%   mutate(ID=factor(ID)) %>%   %>% 
  tidyr::uncount(count) %>%  sample_n(nrow(.))  %>%
  sample_n(size=round(n/fold2),replace = TRUE) %>%   count(ID, name = "value", .drop=FALSE)
}
mel099
  • 23
  • 4
  • 1
    Do you get an error, or do you just run out of memory and crash? This question might be helpful: https://stackoverflow.com/q/34165654/8366499, or this one: https://stackoverflow.com/q/21528752/8366499. Take a look at the `bigmemory` package – divibisan Aug 07 '23 at 16:50
  • Thank you I am looking for it , the error is "vector memory exhausted (limit reached?)" which comes for uncount part since the vector size exceeds the 2^31-1. – mel099 Aug 07 '23 at 16:55
  • 1
    A vector from 1 to 10^13 would already take up 80 TB, so creating the whole potential data range is very inefficient if not impossible for most computers. I presume `dplyr::slice_sample(weight_by = count)` would be a better way to go here. – Jon Spring Aug 07 '23 at 17:17
  • 1
    How many rows are there actually in `df2`? You say that the sum of that column is 10^13 but I assume there are at least some counts > 1. – Dubukay Aug 07 '23 at 17:17
  • there is 300000 row in the data – mel099 Aug 07 '23 at 19:21

0 Answers0