2

i have a set dataframe. My purpose is to convert the dataframe into transactions data in order to do market basket analysis using Arules package in R. I did do some research online regarding conversion of dataframe to transactions data, e.g.(How to prep transaction data into basket for arules) and (Transform csv into transactions for arules), but the result i got was different.

dput(df)

structure(list(Transaction_ID = c("A001", "A002", "A003", "A004", "A005", "A006"), 
Fruits = c(NA, "Apple", "Orange", NA, "Pear", "Grape"), 
Vegetables = c(NA, NA, NA, "Potato", NA, "Yam"), 
Personal = c("ToothP", "ToothP", NA, "ToothB", "ToothB", NA), 
Drink = c("Coff", NA, "Coff", "Milk", "Milk", "Coff"), 
Other = c(NA, NA, NA, NA, "Promo", NA)), 
.Names = c("Transaction_ID", "Fruits", "Vegetables", "Personal", "Drink", "Other"), 
class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -6L))

Below is my dataframe structure

Transaction_ID  Fruits  Vegetables  Personal  Drink  Other
      A001        NA        NA       ToothP   Coff    NA
      A002       Apple      NA       ToothP    NA     NA
      A003      Orange      NA         NA     Coff    NA
      A004        NA      Potato     ToothB   Milk    NA
      A005       Pear       NA       ToothB   Milk   Promo
      A006      Grape      Yam         NA     Coff    NA

class for each column

sapply(df, class)
Transaction_ID         Fruits     Vegetables       Personal          Drink          Other 
"character"    "character"    "character"    "character"    "character"    "character"

Convert dataframe to transaction data

data <- as(split(df[,"Fruits"], df[,"Vegetables"],df[,"Personal"], df[,"Drink"], df[,"Other"]), "transactions")
inspect(data)

Results i got

[1] {NA,NA,ToothP,Coff,NA}
[2] {Apple,NA,ToothP,NA,NA}
[3] {Orange,NA,NA,Coff,NA}
[4] {NA,Potato,ToothB,Milk,NA}
[5] {Pear,NA,ToothB,Milk,Promo}
[6] {Grape,Yam,NA,Coff,NA}

The transaction data was successfully converted, but I was wondering is there any way to remove the NA items? since the NA will take consideration as an item if they still remain in the transaction list.

amonk
  • 1,769
  • 2
  • 18
  • 27
yc.koong
  • 175
  • 2
  • 10

2 Answers2

4

Ogustari is right. Here is the complete code that also handles the transaction IDs.

library("arules")
library("dplyr")  ### for dbl_df
df <- structure(list(Transaction_ID = c("A001", "A002", "A003", "A004", "A005", "A006"), 
  Fruits = c(NA, "Apple", "Orange", NA, "Pear", "Grape"), 
  Vegetables = c(NA, NA, NA, "Potato", NA, "Yam"), 
  Personal = c("ToothP", "ToothP", NA, "ToothB", "ToothB", NA), 
  Drink = c("Coff", NA, "Coff", "Milk", "Milk", "Coff"), 
  Other = c(NA, NA, NA, NA, "Promo", NA)), 
  .Names = c("Transaction_ID", "Fruits", "Vegetables", "Personal", "Drink", "Other"), 
  class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -6L))

### remove transaction IDs
tid <- as.character(df[["Transaction_ID"]])
df <- df[,-1]

### make all columns factors
for(i in 1:ncol(df)) df[[i]] <- as.factor(df[[i]])

trans <- as(df, "transactions")

### set transactionIDs
transactionInfo(trans)[["transactionID"]] <- tid

inspect(trans)

   items                                          transactionID
[1] {Personal=ToothP,Drink=Coff}                   A001         
[2] {Personal=ToothP}                              A002         
[3] {Drink=Coff}                                   A003         
[4] {Vegetables=Potato,Personal=ToothB,Drink=Milk} A004         
[5] {Personal=ToothB,Drink=Milk,Other=Promo}       A005         
[6] {Vegetables=Yam,Drink=Coff}                    A006         
Michael Hahsler
  • 2,965
  • 1
  • 12
  • 16
  • Hi Micheal, thanks for your sharing!! it works in my case, but i have 1 issue regarding the (### remove transaction IDs) part. I found my transactionID was missing After i removed the transaction IDs and set it back into trans. Maybe the temp file was missing? – yc.koong Aug 20 '17 at 03:08
  • tid <- as.character(df[["Transaction_ID"]]) , return of tid is NULL – yc.koong Aug 20 '17 at 03:47
  • @yc.koong I have added two library statements so the example is self-contained and loads arules and dplyr (needed since your data is a tbl_df). It now runs for me in a fresh session without problems. – Michael Hahsler Aug 21 '17 at 14:01
1

I can propose you this solution but I do not know if is the one you are looking for.

dput(df)

df <- data.frame(structure(list(Transaction_ID = as.factor(c("A001", "A002", "A003", "A004", "A005", "A006")), 
               Fruits = as.factor(c(NA, "Apple", "Orange", NA, "Pear", "Grape")), 
               Vegetables = as.factor(c(NA, NA, NA, "Potato", NA, "Yam")), 
               Personal = as.factor(c("ToothP", "ToothP", NA, "ToothB", "ToothB", NA)), 
               Drink = as.factor(c("Coff", NA, "Coff", "Milk", "Milk", "Coff")), 
               Other = as.factor(c(NA, NA, NA, NA, "Promo", NA))), 
          .Names = c("Transaction_ID", "Fruits", "Vegetables", "Personal", "Drink", "Other"), 
          class = c("tbl_df", "tbl", "data.frame"), row.names = c(NA, -6L)))

Class for each column Note that the classe are all "Factor"

sapply(df, class)
Transaction_ID         Fruits     Vegetables       Personal          Drink          Other 
      "factor"       "factor"       "factor"       "factor"       "factor"       "factor"

Convert data frame to transaction data

data <- as(df, "transactions")
inspect(data)

The result I've got

     items                 transactionID
[1] {Transaction_ID=A001,              
     Personal=ToothP,                  
     Drink=Coff}                      1
[2] {Transaction_ID=A002,              
     Fruits=Apple,                     
     Personal=ToothP}                 2
[3] {Transaction_ID=A003,              
     Fruits=Orange,                    
     Drink=Coff}                      3
[4] {Transaction_ID=A004,              
     Vegetables=Potato,                
     Personal=ToothB,                  
     Drink=Milk}                      4
[5] {Transaction_ID=A005,              
     Fruits=Pear,                      
     Personal=ToothB,                  
     Drink=Milk,                       
     Other=Promo}                     5
[6] {Transaction_ID=A006,              
     Fruits=Grape,                     
     Vegetables=Yam,                   
     Drink=Coff}                      6

I found part of the solution here convert data frame in r to transaction or an itemMatrix. Moreover is seems that your command

data <- as(split(df[,"Fruits"], df[,"Vegetables"],df[,"Personal"], df[,"Drink"], df[,"Other"]), "transactions")
inspect(data)

only works for a data.frame containing only two columns.

Ogustari
  • 159
  • 9