4

I have two data.frames--one look-up table that tells me a set products included in a group. Each group has at least one product of Type 1 and Type 2.

The second data.frame tells me details about the transaction. Each transaction can have one of the following products:

a) Only products of Type 1 from one of the groups

b) Only products of Type 2 from one of the groups

c) Product of Type 1 and Type 2 from the same group

For my analysis, I am interested in finding out c) above i.e. how many transactions have products of Type 1 and Type 2 (from the same group) sold. We will ignore the transaction altogether if Product of Type 1 and that of Type 2 from different groups that are sold in the same transaction.

Thus, each product of Type 1 or Type 2 MUST belong to the same group.

Here's my look up table:

> P_Lookup
   Group ProductID1 ProductID2
  Group1          A          1
  Group1          B          2
  Group1          B          3
  Group2          C          4
  Group2          C          5
  Group2          C          6
  Group3          D          7
  Group3          C          8
  Group3          C          9
  Group4          E         10
  Group4          F         11
  Group4          G         12
  Group5          H         13
  Group5          H         14
  Group5          H         15 

For instance, I won't have Product G and Product 15 in one transaction because they belong to different group.

Here are the transactions:

  TransactionID ProductID ProductType
             a1         A           1
             a1         B           1
             a1         1           2
             a2         C           1
             a2         4           2
             a2         5           2
             a3         D           1
             a3         C           1
             a3         7           2
             a3         8           2
             a4         H           1
             a5         1           2
             a5         2           2
             a5         3           2
             a5         3           2
             a5         1           2
             a6         H           1
             a6        15           2

My Code:

Now, I was able to write code using dplyr for shortlisting transactions from one group. However, I am not sure how I can vectorize my code for all groups.

Here's my code:

P_Groups<-unique(P_Lookup$Group)
Chosen_Group<-P_Groups[5]

P_Group_Ind <- P_Trans %>%
group_by(TransactionID)%>%
dplyr::filter((ProductID %in% unique(P_Lookup[P_Lookup$Group==Chosen_Group,]$ProductID1)) | 
(ProductID %in% unique(P_Lookup[P_Lookup$Group==Chosen_Group,]$ProductID2)) ) %>%
mutate(No_of_PIDs = n_distinct(ProductType)) %>%
mutate(Group_Name = Chosen_Group)

P_Group_Ind<-P_Group_Ind[P_Group_Ind$No_of_PIDs>1,]

This works well as long as I manually select each group i.e. by setting Chosen_Group. However, I am not sure how I can automate this. One way, I am thinking is to use for loop, but I know that the beauty of R is vectorization, so I want to stay away from using for loop.

I'd sincerely appreciate any help. I have spent almost two days on this. I looked at using dplyr in for loop in r, but it seems this thread is talking about a different issue.


DATA: Here's dput for P_Trans:

structure(list(TransactionID = c("a1", "a1", "a1", "a2", "a2", 
"a2", "a3", "a3", "a3", "a3", "a4", "a5", "a5", "a5", "a5", "a5", 
"a6", "a6"), ProductID = c("A", "B", "1", "C", "4", "5", "D", 
"C", "7", "8", "H", "1", "2", "3", "3", "1", "H", "15"), ProductType = c(1, 
1, 2, 1, 2, 2, 1, 1, 2, 2, 1, 2, 2, 2, 2, 2, 1, 2)), .Names = c("TransactionID", 
"ProductID", "ProductType"), row.names = c(NA, 18L), class = "data.frame")

Here's dput for P_Lookup:

structure(list(Group = c("Group1", "Group1", "Group1", "Group2", 
"Group2", "Group2", "Group3", "Group3", "Group3", "Group4", "Group4", 
"Group4", "Group5", "Group5", "Group5"), ProductID1 = c("A", 
"B", "B", "C", "C", "C", "D", "C", "C", "E", "F", "G", "H", "H", 
"H"), ProductID2 = c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 
14, 15)), .Names = c("Group", "ProductID1", "ProductID2"), row.names = c(NA, 
15L), class = "data.frame")

Here's the dput() after adding a product to P_Trans that doesn't exist in the look-up table:

structure(list(TransactionID = c("a1", "a1", "a1", "a2", "a2", 
"a2", "a3", "a3", "a3", "a3", "a4", "a5", "a5", "a5", "a5", "a5", 
"a6", "a6", "a7"), ProductID = c("A", "B", "1", "C", "4", "5", 
"D", "C", "7", "8", "H", "1", "2", "3", "3", "1", "H", "15", 
"22"), ProductType = c(1, 1, 2, 1, 2, 2, 1, 1, 2, 2, 1, 2, 2, 
2, 2, 2, 1, 2, 3)), .Names = c("TransactionID", "ProductID", 
"ProductType"), row.names = c(NA, 19L), class = "data.frame")
dww
  • 30,425
  • 5
  • 68
  • 111
watchtower
  • 4,140
  • 14
  • 50
  • 92

2 Answers2

4

Below is a tidyverse (dplyr, tidyr, and purrr) solution that I hope will help.

Note that the use of map_df in the last line returns all results as a data frame. If you'd prefer it to be a list object for each group, then simply use map.

library(dplyr)
library(tidyr)
library(purrr)

# Save unique groups for later use
P_Groups <- unique(P_Lookup$Group)

# Convert lookup table to product IDs and Groups
P_Lookup <- P_Lookup %>% 
              gather(ProductIDn, ProductID, ProductID1, ProductID2) %>% 
              select(ProductID, Group) %>% 
              distinct() %>% 
              nest(-ProductID, .key = Group)

# Bind Group information to transactions
# and group for next analysis
P_Trans <- P_Trans %>%
             left_join(P_Lookup) %>%
             filter(!map_lgl(Group, is.null)) %>%  
             unnest(Group) %>% 
             group_by(TransactionID)

# Iterate through Groups to produce results
map(P_Groups, ~ filter(P_Trans, Group == .)) %>% 
  map(~ mutate(., No_of_PIDs = n_distinct(ProductType))) %>% 
  map_df(~ filter(., No_of_PIDs > 1))
#> Source: local data frame [12 x 5]
#> Groups: TransactionID [4]
#> 
#>    TransactionID ProductID ProductType  Group No_of_PIDs
#>            <chr>     <chr>       <dbl>  <chr>      <int>
#> 1             a1         A           1 Group1          2
#> 2             a1         B           1 Group1          2
#> 3             a1         1           2 Group1          2
#> 4             a2         C           1 Group2          2
#> 5             a2         4           2 Group2          2
#> 6             a2         5           2 Group2          2
#> 7             a3         D           1 Group3          2
#> 8             a3         C           1 Group3          2
#> 9             a3         7           2 Group3          2
#> 10            a3         8           2 Group3          2
#> 11            a6         H           1 Group5          2
#> 12            a6        15           2 Group5          2
Simon Jackson
  • 3,134
  • 15
  • 24
  • 2
    This is a fantastic response. I am a beginner, so I have a quick question: could you please explain why you did `mutate(i = 1:n())` and `as_data_frame()`. I tried running the code without these two lines, and it worked fine. So, I am curious. – watchtower Nov 05 '16 at 08:12
  • Good catch @watchtower! Those lines were just there to help me figure things out as I went along. I've removed them from the answer. – Simon Jackson Nov 05 '16 at 08:53
  • Thanks for the clarification. I have a quick question--my original dataset `P_Trans` has some `ProductID` s that are not in the look-up table `P_Lookup`. In such a case, I get this error `Error: Each column must either be a list of vectors or a list of data frames [Group]`. I am a beginner so I can't figure out how to fix this. Do you think you could help me ? As per SO's policy, I don't want to create a new thread. I have added `dput()` in my post. I'd sincerely appreciate your help – watchtower Nov 05 '16 at 09:49
  • 1
    @watchtower thanks for the accept. I've made a change that will remove any products that are not included in the look-up table: `filter(!map_lgl(Group, is.null))`. This takes place in the binding to `P_Trans`. – Simon Jackson Nov 05 '16 at 20:54
3

Here is a single pipe dplyr solution:

P_DualGroupTransactionsCount <- 
    P_Lookup %>% # data needing single column map of Keys
    gather(IDnum, ProductID, ProductID1:ProductID2) %>% # produce long single map of Keys for GroupID (tidyr::)
    right_join(P_trans) %>% # join transactions to groupID info
    group_by(TransactionID, Group) %>% # organize for same transaction & same group
    mutate(DualGroup = ifelse(n_distinct(ProductType)==2, T, F)) %>% # flag groups with both groups in a single transaction
    filter(DualGroup == T) %>% # choose only doubles
    select(TransactionID, Group) %>% # remove excess columns
    distinct %>%  # remove excess rows
    nrow # count of unique transaction ID's

# P_DualGroupTransactions
# Source: local data frame [4 x 2]
# Groups: TransactionID, Group [4]
#     
# TransactionID  Group
#           <chr>  <chr>
# 1            a1 Group1
# 2            a2 Group2
# 3            a3 Group3
# 4            a6 Group5


# P_DualGroupTransactionsCount
 [1] 4
leerssej
  • 14,260
  • 6
  • 48
  • 57