3

I'd like to convert a data frame to a disk frame and then count the first column. It's not counting the number of unique values of the column when I try it. It appears to be counting the number of workers.

library(disk.frame)
options(future.globals.maxSize = Inf)
setup_disk.frame(workers = 8)

This is an example dataset

    bigint <- sample(123901239804:901283455390, 3*10^5)
    df <- data.frame(bigint)
    df %>% 
      summarize(ints = length(unique(bigint)))
    
    df %>% 
      as.disk.frame %>%
      summarize(ints = length(bigint)) %>% 
      collect

In the first query, it gets me this output

    ints
1 300000

In the second query, it gets me this output

    ints
1      8
Cauder
  • 2,157
  • 4
  • 30
  • 69

1 Answers1

3

{disk.frame} only supports some group-by functions. You can use dplyr::n_distinct

df %>% 
  as.disk.frame %>%
  summarize(ints = n_distinct(bigint)) %>% 
  collect

which yields

    ints
1 300000

See the list of supported group-by verbs here https://diskframe.com/articles/10-group-by.html#list-of-supported-group-by-functions

You can define more customised group-by verbs by following this guide

https://diskframe.com/articles/11-custom-group-by.html

Certain groups-by verbs are not possible to be done in an exact way (and has to rely on estimates) due to the chunking-nature of disk.frame. But that is true of all "big" data systems.

xiaodai
  • 14,889
  • 18
  • 76
  • 140
  • Thanks! I'm going to use disk frame for this problem https://stackoverflow.com/questions/63800688/choosing-the-right-approach-for-a-50gb-file – Cauder Sep 09 '20 at 01:06
  • Is it possible to calculate a given percentile for a column? – Cauder Sep 09 '20 at 01:07
  • @Cauder it's in the list of group-by verbs :) But it's not exact, rather it's an approximate. – xiaodai Sep 09 '20 at 01:07
  • Thank you! Are there any articles comparing disk.frame and data table? I'd love to learn more about when I should be using one approach against the other approach – Cauder Sep 09 '20 at 01:08
  • @Cauder read the common questions. data.table is pure in memory, while disk.frame is on disk. So disk.frame is slower but can handle any data that fits on your disk. See https://diskframe.com/#common-questions – xiaodai Sep 09 '20 at 01:13
  • That's awesome. To put it back to you, disk frame can work with larger datasets because there's more resources on disk than in memory. But, if something fits in memory, then maybe data table is an acceptable solution – Cauder Sep 09 '20 at 01:14
  • disk.frame() looks sweet. Can I pass it a folder of CSV files? – Cauder Sep 09 '20 at 01:16
  • 1
    @Cauder yes. See https://diskframe.com/articles/04-ingesting-data.html#multiple-csv-files Better ask questions here https://github.com/xiaodaigh/disk.frame/issues – xiaodai Sep 09 '20 at 01:18