5

I want to iterate through data.table, just as purrr::map does. While I was able to apply data.table functions by converting data.frame to data.table inside purrr::map, I want to know whether data.table has something built-in that would abnegate using purrr::map. I am asking this because I am unsure of purrr::map's performance in terms of both speed and memory required. I have been disappointed with dplyr's speed and memory utilization, as compared to data.table when dealing with large datasets.

I researched stackoverflow and found that accepted answer on Iterate through data tables thread has used for loop. I am not a big fan of for loop for performance reasons.

Here's sample data file:

dput(Input_File)
structure(list(Zone = c("East", "East", "East", "East", "East", 
"East", "East", "West", "West", "West", "West", "West", "West", 
"West"), Fiscal.Year = c(2016, 2016, 2016, 2016, 2016, 2016, 
2017, 2016, 2016, 2016, 2017, 2017, 2018, 2018), Transaction.ID = c(132, 
133, 134, 135, 136, 137, 171, 171, 172, 173, 175, 176, 177, 178
), L.Rev = c(3, 0, 0, 1, 0, 0, 2, 1, 1, 2, 2, 1, 2, 1), L.Qty = c(3, 
0, 0, 1, 0, 0, 1, 1, 1, 2, 2, 1, 2, 1), A.Rev = c(0, 0, 0, 1, 
1, 1, 0, 0, 0, 0, 0, 1, 0, 0), A.Qty = c(0, 0, 0, 2, 2, 3, 0, 
0, 0, 0, 0, 3, 0, 0), I.Rev = c(4, 4, 4, 0, 1, 0, 3, 0, 0, 0, 
1, 0, 1, 1), I.Qty = c(2, 2, 2, 0, 1, 0, 3, 0, 0, 0, 1, 0, 1, 
1)), .Names = c("Zone", "Fiscal.Year", "Transaction.ID", "L.Rev", 
"L.Qty", "A.Rev", "A.Qty", "I.Rev", "I.Qty"), row.names = c(NA, 
14L), class = "data.frame")

Here's sample code with purrr::map and data.table

UZone <- unique(Input_File$Zone)
FYear <- unique(Input_File$Fiscal.Year)
a<-purrr::map(UZone, ~ dplyr::filter(Input_File, Zone == .)) %>%
   purrr::map(~ data.table::as.data.table(.)) %>%
   purrr::map(~ .[,.(sum = sum(L.Rev)),by=Fiscal.Year])

I am not too concerned about the output, but I want to know what alternatives are available to iterate through data.table based on a specific column. I'd appreciate any thoughts.

watchtower
  • 4,140
  • 14
  • 50
  • 92
  • 2
    Maybe I'm oversimplifying, but isn't this just: `b <- Input_File[, .(sum=sum(L.Rev)), by=.(Zone,Fiscal.Year)]` ? You could `split` to separate parts if you really want - `split(b[,-"Zone"], b$Zone)` – thelatemail Dec 21 '17 at 04:36
  • @Thelatemail - That did help. Sometimes simpler solutions are better than complex ones. If you could post an answer, I can accept it. Thanks so much for your help. I think I went down the rabbit hole. Otherwise, I will keep this question open, just in case we get any other solution. – watchtower Dec 21 '17 at 06:01

2 Answers2

2

Piping data tables can be very well done by repeating [], e.g. DT[][][]. For lists, I think there's no alternatives to magrittr. The rest can be completed by chaining lapply

library(data.table)
library(magrittr)

Input_File <- data.table(Input_File)

UZone <- unique(Input_File$Zone)
FYear <- unique(Input_File$Fiscal.Year)

lapply(UZone, function(x) Input_File[Zone==x]) %>% 
  lapply(function(x) x[,.(sum=sum(L.Rev)), by=Fiscal.Year])

In case you want to iterate over columns, you may want to have a look at this solution

UPDATE: I guess there could be a cleaner solution without importing magrittr and without $ subsetting

library(data.table)

Input_File <- data.table(Input_File)

by_zone_lst <- lapply(Input_File[,unique(Zone)], function(x) Input_File[Zone==x])
summary_lst <- lapply(by_zone_lst, function(y) y[,.(sum=sum(L.Rev)), by=Fiscal.Year])

summary_lst
dmi3kno
  • 2,943
  • 17
  • 31
  • thanks. Is `%>%` more efficient? I'm curious. I have seen in one of the threads that `lapply` is more efficient than `purrr::map` – watchtower Dec 21 '17 at 16:10
  • 1
    It is just a little more readable. You can cut it out and just use lapply(lapply()) – dmi3kno Dec 21 '17 at 16:17
  • Thanks for your help. Do you think you could post how `lapply(lappy())` would look like? It's going to help novice like me and others who are reading the thread. – watchtower Dec 21 '17 at 16:20
  • 1
    Edited the answer to remove `magrittr` pipes. Nested `lapply()` is absolutely unreadable. – dmi3kno Dec 21 '17 at 17:23
1

I'm not sure what's behind the question but I do prefer

library(data.table)
setDT(Input_File)[, .(sum = sum(L.Rev)), by = .(Zone, Fiscal.Year)]
   Zone Fiscal.Year sum
1: East        2016   4
2: East        2017   2
3: West        2016   4
4: West        2017   3
5: West        2018   3

over OP's approach returning a as

[[1]]
   Fiscal.Year sum
1:        2016   4
2:        2017   2

[[2]]
   Fiscal.Year sum
1:        2016   4
2:        2017   3
3:        2018   3
Uwe
  • 41,420
  • 11
  • 90
  • 134
  • thanks for your help. The only reason I wanted to display list is that I could spit the output in different Excel tabs using `write.xlsx` function. – watchtower Dec 24 '17 at 19:58
  • You can chain , say, `split(dt_df, by = "Fiscal.Year")` right after `data.table` summary to make a list. `setDT(Input_File)[, .(sum = sum(L.Rev)), by = .(Zone, Fiscal.Year)] %>% split(., by = 'Fiscal.Year')` – Fred Jul 25 '20 at 15:52