3

I want to add a new column to my data.table. This column should contain the min of two other columns of all rows that satisfy a certain condition. An example with a data.table looking like this:

library(data.table)
DT <- data.table(pattern=c("A", "A & B", "A & B & C", "A & C & D"),
                 value1=c(1, 2, 3, 4),
                 value2=c(5, 6, 7, 8)
)  

     pattern value1 value2
1:         A      1      5
2:     A & B      2      6
3: A & B & C      3      7
4: A & C & D      4      8

For every row x and every row i, where the pattern[x] is a sub-pattern of pattern[i] I want to perform a calculation like:

min((value1[i]-value1[x])/(value1[i]/value2[i]-value1[x]/value2[x]))

Because the patterns have a similar order I am able to retrieve the sub-patterns using a regex and replacing the "&" by a wildcard ".*" and checking if it is not the pattern itself. Hence I am able use a for-loop on each row:

setkey(DT,pattern)
for(i in 1:nrow(DT)) {
  DT[i, foo:=DT[grepl(gsub("&",".*",DT[i]$pattern,fixed=TRUE),pattern) & DT[i]$pattern!=pattern, 
                ifelse(.N==0,
                       NA,
                       min((DT[i]$value1-value1)/(DT[i]$value1/DT[i]$value2-value1/value2)))]]
}

Unfortunately the dataset is rather large and the for-loop is horribly slow on this dataset. I hope somebody can help me with some data.table magic I don't know off to solve this issue. In essence my question is rather similar to this one, but the string-formatted patterns are given, hence I cannot use range joins.

Background: The patterns are output from association rule mining, for instance {onion & potatoes => burger}. There are thousands of different items (like A,B,C & D in the example). I try to add a statistical measure to find out how a rule relates to its sub-rules.

Community
  • 1
  • 1
Roy van der Valk
  • 527
  • 1
  • 6
  • 18
  • How large is your data (rows, columns and size)? And what's the most number of elements in the patterns you have, i.e, A & B & C & D & E & F & .. - what's the maximum elements you can have? – Arun Dec 10 '14 at 09:37
  • I performed it on a sample set of around 600.908 rows, 12 columns (the pattern is in one column) and 181 Mb. I want to perform it on a dataset which is at least 10 times as large. The patterns are actually item1=valueA & item2=valueB & ... - there are now 15 items which can have values ranging from 2 (binary) up to 1.000 categorical values. – Roy van der Valk Dec 10 '14 at 12:37

1 Answers1

0

I don't undertand what calculations you want to perform (I tried running your code, got Inf in two of the rows), but as a general idea, you could do this as an intermediate step:

DT[, hasA := grepl("A", pattern)]
DT[, hasB := grepl("B", pattern)]
DT[, hasC := grepl("C", pattern)]
DT[, hasD := grepl("D", pattern)]
DT[, foo_0 := value1*value2]

and go from there.

Serban Tanasa
  • 3,592
  • 2
  • 23
  • 45
  • Thanks Serban. The different pattern combinations are in the thousands and the calculation is more complex than the given one, so I cannot solve it with your solution. – Roy van der Valk Dec 10 '14 at 09:06
  • I updated the question with more background and a fix for the Inf (these are in fact association rules without sub-rules, so now I return NA in these cases) – Roy van der Valk Dec 10 '14 at 09:08