I have a data set with 80+ million rows. Because of memory shortage I can't manipulate with this data correctly and getting error messages like "can not allocate vector of 180 MB" or so. I found out the library disk.frame which helps to manipulate the data without downloading it to the RAM. Sounds amazing. The example code works good. But mine doesn't work. I do the following:
setup_disk.frame()
Test_DF <- csv_to_disk.frame("test results.csv")
Everything looks good.
When I check it with head(Test_DF)
I see my data.
The problem starts when I try to manipulate with data
Test_DF <- Test_DF %>%
srckeep("Cust(Child)-Entity", "ParentCustName", "Item-Entity", "SBLOC", "Pred_Entity_Loc", "SBPHYP") %>%
group_by(`Cust(Child)-Entity`, ParentCustName, `Item-Entity`, SBLOC, Pred_Entity_Loc, SBPHYP) %>%
summarise("Historical_Sales" = sum(OldSales, na.rm = TRUE),
"Historical_COGS" = sum(OldCost, na.rm = TRUE),
"Historical_Net_COGS" = sum(OldCost_Net, na.rm=TRUE),
"Historical_Qty" = sum(Qty, na.rm=TRUE)) %>%
collect
The error message is the following
Error in parse(text = paste0(func_call_str, "_df.chunk_agg.disk.frame")) :
<text>:1:2: unexpected input
1: -_
^
In addition: Warning message:
In collect.summarized_disk.frame(.) :
These columns that appear in the group-by and summarise does not appear in the original data set: sum, -, Historical_Qty. This set of action is too hard for disk.frame to figure out the `srckeep` automatically, you must do the `srckeep` manually.
Please advise