Assume we have 3 rules:
[1] {A,B,D} -> {C}
[2] {A,B} -> {C}
[3] Whatever it is
Rule [2]
is a subset of rule [1]
(because rule [1]
contains all the items in rule [2]
), so rule [1]
should be eliminated (because rule [1]
is too specific and its information is included in rule [2]
)
I searched through the internet and everyone is using these code to remove redundant rules:
subset.matrix <- is.subset(rules.sorted, rules.sorted)
subset.matrix[lower.tri(subset.matrix, diag=T)] <- NA
redundant <- colSums(subset.matrix, na.rm=T) >= 1
which(redundant)
rules.pruned <- rules.sorted[!redundant]
I dont understand how the code work.
After line 2 of the code, the subset.matrix will become:
[,1] [,2] [,3]
[1,] NA 1 0
[2,] NA NA 0
[3,] NA NA NA
The cells in the lower triangle are set to be NA and since rule [2]
is a subset of rule [1]
, the corresponding cell is set to 1. So I have 2 questions:
Why do we have to set the lower triangle as NA? If we do so then how can we check whether rule
[2]
is subset of rule[3]
or not? (the cell has been set as NA)In our case, rule
[1]
should be the one to be eliminated, but these code eliminate rule[2]
instead of rule[1]
. (Because the first cell in column 2 is 1, and according to line 3 of the code, the column sums of column 2 >= 1, therefore will be treated as redundant)
Any help would be appreciated !!