I want to add a new column to my data.table. This column should contain the min of two other columns of all rows that satisfy a certain condition. An example with a data.table looking like this:
library(data.table)
DT <- data.table(pattern=c("A", "A & B", "A & B & C", "A & C & D"),
value1=c(1, 2, 3, 4),
value2=c(5, 6, 7, 8)
)
pattern value1 value2
1: A 1 5
2: A & B 2 6
3: A & B & C 3 7
4: A & C & D 4 8
For every row x and every row i, where the pattern[x] is a sub-pattern of pattern[i] I want to perform a calculation like:
min((value1[i]-value1[x])/(value1[i]/value2[i]-value1[x]/value2[x]))
Because the patterns have a similar order I am able to retrieve the sub-patterns using a regex and replacing the "&" by a wildcard ".*" and checking if it is not the pattern itself. Hence I am able use a for-loop on each row:
setkey(DT,pattern)
for(i in 1:nrow(DT)) {
DT[i, foo:=DT[grepl(gsub("&",".*",DT[i]$pattern,fixed=TRUE),pattern) & DT[i]$pattern!=pattern,
ifelse(.N==0,
NA,
min((DT[i]$value1-value1)/(DT[i]$value1/DT[i]$value2-value1/value2)))]]
}
Unfortunately the dataset is rather large and the for-loop is horribly slow on this dataset. I hope somebody can help me with some data.table magic I don't know off to solve this issue. In essence my question is rather similar to this one, but the string-formatted patterns are given, hence I cannot use range joins.
Background: The patterns are output from association rule mining, for instance {onion & potatoes => burger}. There are thousands of different items (like A,B,C & D in the example). I try to add a statistical measure to find out how a rule relates to its sub-rules.