Arules Sequence Mining in R

Question

I am looking to use the arulesSequences package in R. However, I have no idea as to how to coerce my data frame into an object that can leverage this package.

Here is a toy dataset that replicates my data structure:

ids <- c(rep("X", 5), rep("Y", 5), rep("Z", 5))
seq <- rep(1:5,3)
val <- sample(LETTERS, 15, replace=T)
df <- data.frame(ids, seq, val)
df

   ids seq val
1    X   1   T
2    X   2   H
3    X   3   V
4    X   4   A
5    X   5   X
6    Y   1   D
7    Y   2   B
8    Y   3   A
9    Y   4   D
10   Y   5   P
11   Z   1   Q
12   Z   2   R
13   Z   3   W
14   Z   4   W
15   Z   5   P

Any help will be greatly appreciated.

To be clear: this data frame represents three sequences? `X="THVAX"; Y="DBADP"; Z=QRWWP"`? (Why is it stored that way?) — David Robinson, Oct 23 '12 at 00:52
If I wanted to just use the arules package, I would only keep the ids and val column. Each of the 3 transactions (X/Y/Z) would have 5 items. Because I want to do sequence mining (factor in the order of each item), I need to have a sequence/timing variable. I am struggling with how to generate transactions that retain this "timing" component. — Btibert3, Oct 23 '12 at 15:52

score 1 · Answer 1 · answered Oct 23 '12 at 02:03

1

Factor data frame:

df_fact = data.frame(lapply(df,as.factor))

Build "transaction" data:

df_trans = as(df_fact, 'transactions')

Test it:

itemFrequencyPlot(df_trans, support = 0.1, cex.names=0.8)

answered Oct 23 '12 at 02:03

chris

717
2
5
16

1

Thanks for your help, but I believe the key is that the transactions data frame needs to retain temporal information, per the data argument in the cspade function from the package arulesSequences. Here is the code and the resulting error when I try to use cspade: tmp <- cspade(df_trans) Error in cspade(df_trans) : slot transactionInfo: missing 'sequenceID' or 'eventID' – Btibert3 Oct 23 '12 at 12:39
@Btibert3 did you find out how to do this? – eflores89 Feb 23 '15 at 22:28
1

@eflores89 No, I haven't. Quite frankly, knowing what I know now, I might move to modeling this in Neo4j – Btibert3 Mar 12 '15 at 17:07

score 1 · Answer 2 · answered Mar 18 '15 at 09:25

By using read_baskets:

    read_baskets(con  = filePath.txt,
      sep = " ",
      info = c("sequenceID","eventID","SIZE"))

Which in practice means exporting the created data to a text-file and re-importing it through read_baskets. The info argument defines the first columns containing the sequenceID, eventID and an optional eventsize column.

score 1 · Answer 3 · answered Jan 14 '16 at 12:23

It worked for me add an essentially "order" column that lists a order ranking rather than a time value. You just have to be very specific in the naming convention. Try and name the "group" or "ordered basket #" variable sequenceID, and call the ranking or ordering eventID.

Another thing that helped me (and had me scratching my head for a long time) was that read_baskets() seemed to need me to specify

read_baskets(con  = filePath.txt, sep = " ", info = c("sequenceID","eventID","SIZE"))

Even though the help function makes the c() details seem like an optional header, it is not. I seemed to need to remove the header from my file and specify it in the read_baskets() command, or I'd run into problems.

score 0 · Answer 4 · answered Aug 27 '15 at 17:53

0

Instead of using the data frame, what worked best for me was to split the data into individual and than convert to transactions.

 eh$cost<-split(eh$cost$val ,eh$cost$id)
 eh$cost1<- as(eh$cost,"transactions")

answered Aug 27 '15 at 17:53

Akshata T

37
5

score 0 · Answer 5 · answered Jun 21 '17 at 13:40

You have to first change your items into transactions so just coerce the column of items
trans = as(df[,'val'], "transactions")

then you can add the information to your transactions object

trans@itemsetInfo$transactionID = NULL trans@itemsetInfo$sequenceID = df$ids trans@itemsetInfo$eventID = df$seq

score 0 · Answer 6 · answered Apr 12 '21 at 12:48

df <- df %>% arrange(id,seq) %>% summarise(size=n(), items=list(val))

then write to txt (this tutorial also suggest that after a data wrangling write it then read it with read_basket function)

df$items <- as.character(df$items)
write.table(df, file = "trans.txt", sep = " ", row.names = FALSE, col.names = FALSE)

read the file and check it

x <- read_baskets("trans.txt", sep = " ", info = c("sequenceID","eventID","SIZE"))
as(x, "data.frame")

the trans.txt requires some character cleaning e.g with notepad++ — lgadar, Apr 12 '21 at 12:51

Arules Sequence Mining in R

6 Answers6

Linked