According to the article, which defines BIDE:
BIDE: Efficient Mining of Frequent Closed Sequences
Theorem 2 (BackScan search space pruning):
Let the prefix sequence be an n-sequence,
Sp=e1e2...en
. If∃i(1≤i≤n)
and there exists an iteme′
which appears in each of the i-th semi-maximum periods of the prefixSp
inSDB
, we can safely stop growing prefixSp
.Proof:
Because item
e′
appears in each of the i-th semi-maximum periods of the prefixSp
inSDB
, we can get a new prefixS′p=e1e2...ei−1e′ei...en
(1<i≤n)
orS′p=e′e1e2...en(i=1)
, and both(Sp ⊂ S′p)
and(supSDB(Sp)=supSDB(S′p))
hold. Any locally frequent iteme′′
w.r.t. prefixSp
is also a locally frequent item w.r.t.S′p
, in the meantime(〈Sp,e′′〉⊂〈S′p,e′′〉)
and(supSDB(〈Sp,e′′〉)=supSDB(〈S′p,e′′〉))
hold. This means there is no hope to mine frequent closed sequences with prefixSp
.
I understand that for example if I have an AB
pattern, and I find an e'
, for example C
, which is in the 2nd maximum period of the pattern, so between A
and B
for every sequence, which contains AB
, then I can stop growing AB
, because I could use backward extension to make ACB
, which will have the same support, but it is longer. So any pattern I would get by extending AB
forward, certainly won't be a closed pattern, because the C
is missing from it. That's why I have to stop growing AB
and wait until PrefixSpan grows A -> AC -> ACB
with forward extension. What I don't understand why the maximum period must be constrained to the semi-maximum period in this context and why testing for the semi-maximum period is okay. The article does not write a real explanation. Any idea? Can you write an example where we lose closed frequent patterns by using the maximum period instead of the semi-maximum period?