9

I am looking to use the arulesSequences package in R. However, I have no idea as to how to coerce my data frame into an object that can leverage this package.

Here is a toy dataset that replicates my data structure:

ids <- c(rep("X", 5), rep("Y", 5), rep("Z", 5))
seq <- rep(1:5,3)
val <- sample(LETTERS, 15, replace=T)
df <- data.frame(ids, seq, val)
df

   ids seq val
1    X   1   T
2    X   2   H
3    X   3   V
4    X   4   A
5    X   5   X
6    Y   1   D
7    Y   2   B
8    Y   3   A
9    Y   4   D
10   Y   5   P
11   Z   1   Q
12   Z   2   R
13   Z   3   W
14   Z   4   W
15   Z   5   P

Any help will be greatly appreciated.

rcs
  • 67,191
  • 22
  • 172
  • 153
Btibert3
  • 38,798
  • 44
  • 129
  • 168
  • To be clear: this data frame represents three sequences? `X="THVAX"; Y="DBADP"; Z=QRWWP"`? (Why is it stored that way?) – David Robinson Oct 23 '12 at 00:52
  • If I wanted to just use the arules package, I would only keep the ids and val column. Each of the 3 transactions (X/Y/Z) would have 5 items. Because I want to do sequence mining (factor in the order of each item), I need to have a sequence/timing variable. I am struggling with how to generate transactions that retain this "timing" component. – Btibert3 Oct 23 '12 at 15:52
  • Hi, Did you find an answer to this problem? – Sir1 Feb 24 '18 at 12:00

6 Answers6

1

Factor data frame:

df_fact = data.frame(lapply(df,as.factor))

Build "transaction" data:

df_trans = as(df_fact, 'transactions')

Test it:

itemFrequencyPlot(df_trans, support = 0.1, cex.names=0.8)
chris
  • 717
  • 2
  • 5
  • 16
  • 1
    Thanks for your help, but I believe the key is that the transactions data frame needs to retain temporal information, per the data argument in the cspade function from the package arulesSequences. Here is the code and the resulting error when I try to use cspade: tmp <- cspade(df_trans) Error in cspade(df_trans) : slot transactionInfo: missing 'sequenceID' or 'eventID' – Btibert3 Oct 23 '12 at 12:39
  • @Btibert3 did you find out how to do this? – eflores89 Feb 23 '15 at 22:28
  • 1
    @eflores89 No, I haven't. Quite frankly, knowing what I know now, I might move to modeling this in Neo4j – Btibert3 Mar 12 '15 at 17:07
1

By using read_baskets:

    read_baskets(con  = filePath.txt,
      sep = " ",
      info = c("sequenceID","eventID","SIZE"))

Which in practice means exporting the created data to a text-file and re-importing it through read_baskets. The info argument defines the first columns containing the sequenceID, eventID and an optional eventsize column.

1

It worked for me add an essentially "order" column that lists a order ranking rather than a time value. You just have to be very specific in the naming convention. Try and name the "group" or "ordered basket #" variable sequenceID, and call the ranking or ordering eventID.

Another thing that helped me (and had me scratching my head for a long time) was that read_baskets() seemed to need me to specify

read_baskets(con  = filePath.txt, sep = " ", info = c("sequenceID","eventID","SIZE"))

Even though the help function makes the c() details seem like an optional header, it is not. I seemed to need to remove the header from my file and specify it in the read_baskets() command, or I'd run into problems.

ednaMode
  • 443
  • 3
  • 14
0

Instead of using the data frame, what worked best for me was to split the data into individual and than convert to transactions.

 eh$cost<-split(eh$cost$val ,eh$cost$id)
 eh$cost1<- as(eh$cost,"transactions")
Akshata T
  • 37
  • 5
0

You have to first change your items into transactions so just coerce the column of items
trans = as(df[,'val'], "transactions")

then you can add the information to your transactions object

trans@itemsetInfo$transactionID = NULL trans@itemsetInfo$sequenceID = df$ids trans@itemsetInfo$eventID = df$seq

P_Sta
  • 55
  • 1
  • 10
0
df <- df %>% arrange(id,seq) %>% summarise(size=n(), items=list(val))

then write to txt (this tutorial also suggest that after a data wrangling write it then read it with read_basket function)

df$items <- as.character(df$items)
write.table(df, file = "trans.txt", sep = " ", row.names = FALSE, col.names = FALSE)

read the file and check it

x <- read_baskets("trans.txt", sep = " ", info = c("sequenceID","eventID","SIZE"))
as(x, "data.frame")
lgadar
  • 169
  • 1
  • 8