-1

I have encountered a problem, that i am sure has a easy solution, but i cannot find it. I basically summarise my table to get the sum of an value per level of a factor variable:

 NOdependants <- unique(claimsMonthly[policyID == policy, .(exposure = sum(exposure)),
                                        by = c("productID", "Year", "product", "QualityCheck", "dependant")][order(Year)])

   productID       Year product QualityCheck dependant exposure
1:           1      2016  ELI18            0  EMPLOYEE 17.041096
2:           1      2016  ELI18            0    SPOUSE 40.484932
3:           1      2016  ELI18            0     CHILD 5.164384

and then i do the following:

NOdependants <- dcast(NOdependants,  productID + Year ~ dependant, value.var = "exposure", fill = 0, drop = FALSE, fun.aggregate = sum)
setnames(NOdependants, c("CHILD", "EMPLOYEE", "SPOUSE"), c("childno", "employeeno", "spouseno"), skip_absent=TRUE)

> NOdependants
   productRank startYear  childno employeeno spouseno
1:           1      2016 5.164384   17.041096 41.484932

This is all good so far. The problem is when a product doesnt have any data on one of the factor in dependant. Lets say there are no children:

 NOdependants <- unique(claimsMonthly[policyID == policy, .(exposure = sum(exposure)),
                                        by = c("productID", "Year", "product", "QualityCheck", "dependant")][order(Year)])

   productID       Year product QualityCheck dependant exposure
1:           1      2016  ELI18            0  EMPLOYEE 17.041096
2:           1      2016  ELI18            0    SPOUSE 40.484932

Then my dcast does the following:

> NOdependants
   productRank startYear  employeeno spouseno
1:           1      2016  17.041096 41.484932

and this is a problem for me, I need to have all three columns. SO what i need is to artifically create an additional columns in case a factor level is without data (like child in here), so i would get something this:

> NOdependants
   productRank startYear  childno employeeno spouseno
1:           1      2016       0   17.041096 41.484932

for now i have create a workround, where i first create an empty data.table and then use rbindlist with fill=0 to merge theses, but there must be some easier solution.

Any ideas?

Note: i am working with a lot of data and this operation is part of a loop which will be repeated around 80 times or so, so ideally something efficient is possible.

SIMPLIFIED EXAMPLE WITH DATA:

#
> claimsMonthly <- data.table(productID = c(rep(1,6), rep(2,3), rep(3,2)),
+                      Year = c(rep(2015,9), 2016, 2016),
+                      product = c(rep("ELI18",6), rep("JCI22",3), rep("ZDP01",2)),
+                      dependant = c(rep(c("EMPLOYEE", "SPOUSE", "CHILD"), 3),"EMPLOYEE", "SPOUSE"),
+                      QualityCheck = c(rep(0,11)),
+                      exposure = c(abs(rnorm(11))))
> 
> productIDs <- unique(claimsMonthly$productID)
> for(prod in productIDs){
+  
+   NOdependants <- unique(claimsMonthly[ productID == prod, .(exposure = sum(exposure)),
+                                         by = c("productID", "Year", "product", "QualityCheck", "dependant")][order(Year)])
+   
+   NOdependants <- dcast(NOdependants,  productID + Year ~ dependant, value.var = "exposure", fill = 0, drop = FALSE, fun.aggregate = sum)
+   setnames(NOdependants, c("CHILD", "EMPLOYEE", "SPOUSE"), c("childno", "employeeno", "spouseno"), skip_absent=TRUE) 
+ 
+   NOdependants[order(childno)]
+     
+ }
Error in .checkTypos(e, names_x) : 
  Object 'childno' not found amongst productID, Year, employeeno, spouseno
Nneka
  • 1,764
  • 2
  • 15
  • 39

1 Answers1

0

Your use of 'unique' outside of data.table brackets may perplex data.table here. Please See: https://www.rdocumentation.org/packages/data.table/versions/1.12.8/topics/duplicated

I am wondering if your code could be simpler and achieve your result just as well. Some of the beauty of rdata.table is its ability to eliminate the need for loops and control structures. Using your sample data for 'claimsMonthly':

claimsMonthly[, .(exposure = sum(exposure)),
.(productID,Year,product,QualityCheck,dependant)][
,dcast(.SD, productID + Year ~ dependant,
value.var = "exposure", drop = FALSE, fun.aggregate = sum)][
         CHILD == 0 &
         EMPLOYEE == 0 &
         SPOUSE == 0,.(productID,Year,CHILD,EMPLOYEE,SPOUSE)]

       productID Year CHILD EMPLOYEE SPOUSE
    1:         1 2016     0        0      0
    2:         2 2016     0        0      0
    3:         3 2015     0        0      0
rferrisx
  • 1,598
  • 2
  • 12
  • 14
  • thank you @rferrisx. I do not fully understand what it means to "perplex data.table" but i will look into it. Had no idea this could be an issue. Also i am a beginner with data.table, trying to learn hoe to use the effectively. So the ultimate goal would be to remove the loops. However, your solution results in 0 over all the cells ? So it doesnt really provide what i need – Nneka Jun 03 '20 at 11:47
  • i see what you mean, i got rid of the loop and it really solves all my troubles ... – Nneka Jun 03 '20 at 12:08
  • 1
    If you use commands inside data.table brackets, a data. table is returned. Additionally, data.table uses 'lapply' to iterate over lists. See https://stackoverflow.com/questions/32276887/use-of-lapply-sd-in-data-table-r. Also, the vignettes for rdata.table are well done. The github page forcrdata.table contains lots of tutorials. In my use of data.table, I compare and contrast what and how I would use rdata.table vs SQL query syntax. That helps ground my usage patterns. – rferrisx Jun 04 '20 at 20:38