3

How to remove the sub-sequences from cspade algorithm in arulesSequence package in R, For example if my data(Sample.txt) is as below

Column Names: sequenceID, EventID, size, Item

1   1   1   A
1   2   1   B
1   3   1   C
1   4   1   D
2   1   1   A
2   2   1   B
2   3   1   C
3   1   1   A
3   2   1   B
3   3   1   C
3   4   1   D

After running the below arulesSequence line of codes

library("arulesSequences")
#### while importing the Sample.txt remove the column names #####
SymptomArulesSeq <- read_baskets("Sample.txt",sep = "[ \t]+",info =  c("sequenceID","eventID","size"))
s1 <- cspade(SymptomArulesSeq, parameter = list(support = 0.1), control = list(verbose = TRUE),tmpdir = tempdir())
summary(s1)
as(s1, "data.frame")

sequence    support
<{A}>   1
<{B}>   1
<{C}>   1
<{D}>   0.6666667
<{A},{D}>   0.6666667
<{B},{D}>   0.6666667
<{C},{D}>   0.6666667
<{B},{C},{D}>   0.6666667
<{A},{C},{D}>   0.6666667
<{A},{B},{C},{D}>   0.6666667
<{A},{B},{D}>   0.6666667
<{A},{C}>   1
<{B},{C}>   1
<{A},{B},{C}>   1
<{A},{B}>   1

How to find the full length sequences without loosing the items between?

As from the data, the main full length sequence starting from A is A (1), A->B (1), A->B->C (1) and A->B->C->D (0.67), so How can I remove the intermediate sub-sequences and want the results as mentioned.

Challenge here is how to eliminate the sequences which are formed in between like B, B->C etc and also how to eliminate the sequences like A->B->D (Here I'm loosing the actual sequence; item C is discarded)

RajaSekhar
  • 61
  • 1
  • 6
  • So by your definition, the only "full-length" sequence is `A->B->C->D`? So you only want the longest sequence that contains all the elements? Or what is your expected result? – MrFlick Jun 25 '14 at 18:00
  • Basically I need a continuous sequence which starts from the first purchase, here B->C is also a sequence but this sequence is missing the fact it also starts from A The sequence A->B->D can be eliminated by adding `parameter = list(support = 0.1,mingap=1)` – RajaSekhar Jun 25 '14 at 19:31
  • Where do "purchases" come into play here? You didn't answer my question about expected output. I still have no idea what you want as a result. You've only made me more confused. – MrFlick Jun 25 '14 at 19:47
  • The sequenceID here is the customerID and A,B,C and D are items, the data explains the customerID "1" purchased "A" in the first transaction, "B" in the second transaction etc., I need only the sequences where the item should start with the first purchase item – RajaSekhar Jun 26 '14 at 08:48
  • What you are looking for are the maximal or closed itemsets... whether it is one or the other depends exactly on the criteria that you want to use to discard the other sequences. The methods %ain%, is.subset, is.superset can help you. For method ruleInduction there is a control parameter to specify maximally frequent itemsets but for cspade the documentation does not mention it. – Picarus Jun 02 '15 at 02:00
  • Did you solve it? Im looking for something similiar. In your dataset I would like to extract the following rules for the first SequenceID: A B C D AB ABC ABCD BC BCD CD. Is it possible somehow? – Developer Mar 06 '17 at 10:04

0 Answers0