1

I have a data set with 80+ million rows. Because of memory shortage I can't manipulate with this data correctly and getting error messages like "can not allocate vector of 180 MB" or so. I found out the library disk.frame which helps to manipulate the data without downloading it to the RAM. Sounds amazing. The example code works good. But mine doesn't work. I do the following:

setup_disk.frame()
Test_DF <- csv_to_disk.frame("test results.csv")

Everything looks good. When I check it with head(Test_DF) I see my data. The problem starts when I try to manipulate with data

Test_DF <- Test_DF %>%
  srckeep("Cust(Child)-Entity", "ParentCustName", "Item-Entity", "SBLOC", "Pred_Entity_Loc", "SBPHYP") %>%
  group_by(`Cust(Child)-Entity`, ParentCustName, `Item-Entity`, SBLOC, Pred_Entity_Loc, SBPHYP) %>% 
  summarise("Historical_Sales" = sum(OldSales, na.rm = TRUE), 
            "Historical_COGS" = sum(OldCost, na.rm = TRUE), 
            "Historical_Net_COGS" = sum(OldCost_Net, na.rm=TRUE), 
            "Historical_Qty" = sum(Qty, na.rm=TRUE)) %>%
  collect

The error message is the following

Error in parse(text = paste0(func_call_str, "_df.chunk_agg.disk.frame")) : 
  <text>:1:2: unexpected input
1: -_
     ^
In addition: Warning message:
In collect.summarized_disk.frame(.) :
  These columns that appear in the group-by and summarise does not appear in the original data set: sum, -, Historical_Qty. This set of action is too hard for disk.frame to figure out the `srckeep` automatically, you must do the `srckeep` manually.

Please advise

Phil
  • 7,287
  • 3
  • 36
  • 66
grislepak
  • 31
  • 3
  • 1
    AFAIK, the data is only read into R when specifically requested. It sounds like your csv is malformed at some point in the file that you do not hit with `head(Test_DF)` but only encounter with the longer command. – JBGruber Oct 12 '22 at 13:51

0 Answers0