Pipe output of more than one variable using tidyr::map or dplyr

Question

I am building upon my earlier question on SO at Pipe output of one data.frame to another using dplyr

I want to create six correlation matrices that would let me analyze evolution of correlation in $ spent and quantity sold in the last three years. In essence, I am looking for 2 X [3X3] type list. So far, I am able to create 3X3 list using tidyr::map() by making separate calls for each Product_Type and Quantity, but I have been unsuccessful in doing in one vectorized call. As you will see below, there is a lot of redundancy in my code.

Here's my data:

dput(DFile_Gather)
structure(list(Order.ID = c(456, 567, 345, 567, 2345, 8910, 8910, 
789, 678, 456, 345, 8910, 234, 1234, 456), Calendar.Year = c(2015, 
2015, 2016, 2015, 2017, 2015, 2015, 2016, 2015, 2015, 2016, 2015, 
2016, 2016, 2015), Product_Type = c("Insurance", "Insurance", 
"Tire", "Tire", "Rental", "Insurance", "Servicing", "Truck", 
"Tire", "Servicing", "Truck", "Rental", "Car", "Servicing", "Tire"
), Mexican_Pesos = c(35797.32, 1916.25, 19898.62, 0, 22548.314011, 
686.88, 0, 0, 0, 0, 0, 0, 0, 0, 203276.65683), Quantity = c(0.845580721440663, 
0.246177053792905, 2.10266268677851, 1.89588258358317, 0.00223077008050406, 
0.454640961140588, 1.92032156606277, 0.475872861771994, 0.587966920885798, 
0.721024745664671, 0.696609684682582, 0.0441522564791413, 0.872232778060772, 
0.343347997825813, 0.716224049425646)), .Names = c("Order.ID", 
"Calendar.Year", "Product_Type", "Mexican_Pesos", "Quantity"), row.names = c(54L, 
55L, 13L, 15L, 50L, 58L, 28L, 37L, 16L, 24L, 33L, 48L, 2L, 29L, 
14L), class = "data.frame")

Here's my code for the first iteration : i.e. calculate correlation matrix for Product_Type

DFile_Spread_PType <- spread(DFile_Gather[-length(DFile_Gather)],key = Product_Type, value = Mexican_Pesos)

DFile<-DFile_Spread_PType
CYear <- unique(DFile$Calendar.Year)
DFile_Corr_PType <- purrr::map(CYear, ~ dplyr::filter(DFile, Calendar.Year == .)) %>% 
  purrr::map(~ cor(.[,colnames(DFile)[3:length(colnames(DFile))]]) ) %>%
  structure(., names = CYear)

Finally, here's my code for the second iteration for Correlation Matrix by Quantity:

DFile_Spread_Qty <- spread(subset( DFile_Gather, select = -Mexican_Pesos),key = Product_Type, value = Quantity)
DFile<-DFile_Spread_Qty
DFile_Corr_Qty <- purrr::map(CYear, ~ dplyr::filter(DFile, Calendar.Year == .)) %>% 
  purrr::map(~ cor(.[,colnames(DFile)[3:length(colnames(DFile))]]) ) %>%
  structure(., names = CYear)

As you can see above, there is too much redundancy, and the code looks really clunky. I would sincerely appreciate if someone could help me out. I am specifically looking for two things:

1) do what I am doing above by not having any redundancy

2) If possible, get a list of 2X3X3 i.e. Quantity and Product_Type at top level, and then 3x3 correlation matrices referenced to each of the above.

I searched similar topics on SO, but I don't think there was any thread on similar topics.

Thanks in advance.

score 1 · Accepted Answer · answered Nov 06 '16 at 23:08

1

The following has no redundancy and uses no packages. Make Product_Type a factor and then split by year giving the list of years s. Now use a double Map over s and Values converting to wide form on each inner iteration using tapply and running cor.

DG <- transform(DFile_Gather, Product_Type = factor(Product_Type))
s <- split(DG, DG$Calendar.Year)
Values <- c("Mexican_Pesos", "Quantity")
By <- c("Order.ID", "Product_Type")
res <- Map(function(v) Map(function(s) cor(tapply(s[, v], s[By], c)), s), Values)

answered Nov 06 '16 at 23:08

G. Grothendieck

254,981
17
203
341

@G Grothendieck - Thanks for your help. This does solve the problem. Very respectfully, do you think you could demystify the last line a little bit? I tried understanding what you have done by executing the statement in parts but I am getting all sort ungodly errors that I don't think I will ever be able to understand. I'd appreciate if you could explain the last statement by breaking it down into executable pieces that could be run on our machine because it's really powerful, and if understood properly, it will help me a lot in the future. – watchtower Nov 06 '16 at 23:48
1

`tapply(s[, v], v[By], c)` converts `Mexican_Pesos` or `Quantity` (depending on the value of `v`) for one year of data (depending on `s`) to wide form such that the wide form rows are defined by `By[1]`, i.e. by `Order.ID` and the columns by `By[2]`, i.e. by `Product_Type`. We take the `cor` of that. The `cor(tapply(...))` can be thought of as the body of a double loop that iterates over all combinations of `s` and `Values`. – G. Grothendieck Nov 07 '16 at 00:17

score 1 · Answer 2 · answered Nov 07 '16 at 17:50

To get the correlations among Product_Type for each response variable and year combination, you could reshape your dataset into a convenient format, split the dataset into a list for the combination of factors, and get the correlations via map with the help of dplyr::select for selecting the columns. However, this does not return a list of lists.

library(purrr)
library(tidyr)

DFile_Gather %>%
    gather(type, value, Mexican_Pesos:Quantity) %>%
    spread(Product_Type, value) %>%
    split(list(.$Calendar.Year, .$type)) %>%
    map(~cor(dplyr::select(.x, Car:Truck)))

The list of lists took an extra step, as I had to first split by the response variable and then, within each element of that list, split by Calendar.Year. Then I used at_depth instead of map to calculation the correlations across Product_Type for each list within the list. Working at the lowest level is indicated by the 2 in at_depth.

DFile_Gather %>%
    gather(type, value, Mexican_Pesos:Quantity) %>%
    spread(Product_Type, value) %>%
    split(.$type) %>%
    map(~split(.x, .x$Calendar.Year)) %>%
    at_depth(2, ~cor(dplyr::select(.x, Car:Truck)))

The first few rows/columns of the interim dataset after gathering and spreading looks like:

   Order.ID Calendar.Year          type       Car    Insurance       Rental
1       234          2016 Mexican_Pesos 0.0000000           NA           NA
2       234          2016      Quantity 0.8722328           NA           NA
3       345          2016 Mexican_Pesos        NA           NA           NA
4       345          2016      Quantity        NA           NA           NA
5       456          2015 Mexican_Pesos        NA 3.579732e+04           NA
6       456          2015      Quantity        NA 8.455807e-01           NA
...

Pipe output of more than one variable using tidyr::map or dplyr

2 Answers2