-3

I have some vectors like

A1 = c(A,B,C)
A2 = c(A,B,C)
A3 = c(A,B,NA)
A4 = c(NA,B,C)

Now I want something which will give me results like :

Pattern (A,B,C) occurs 2 times.
Pattern (A,B) occurs 3 times.
Pattern (B,C) occurs 3 times.

For now I take each vector and compare them. By this way i can find A,B,C pattern but not A,B or B,C pattern.

Is there any package or some mathematical model which can do it?

EDIT1 : I will not be able to post the code due to some confidentiality issues but essentialy what I did was I compared first vector with second and then to third and so on using %in%. It gave me a matrix of true false. Then I repeated the process for all vectors. Lastly I found out where true have max density in the matrix.

Edit 2 : I know of a-priori algorithm and arules package but a-priori is not very efficient.

Sim101011
  • 305
  • 1
  • 13
  • Have you tried ngram? http://stackoverflow.com/questions/8161167/what-algorithm-i-need-to-find-n-grams – vrajs5 Mar 27 '15 at 06:19
  • I will not be able to post the code due to some confidentiality issues but essentialy what I did was I compared first vector with second and then to third and so on using %in%. It gave me a matrix of true false. Then I repeated the process for all vectors. Lastly I found out where true have max density in the matrix. – Sim101011 Mar 27 '15 at 06:21
  • I cannot use n-grams. – Sim101011 Mar 27 '15 at 06:32
  • Are these vectors always the same lengths with values in the same positions and NA padding out the missing parts? – Spacedman Mar 27 '15 at 10:07
  • @Spacedman : yes, the vectors were made of same length by inserting NA. – Sim101011 Mar 27 '15 at 12:43

2 Answers2

0

It could become shorter but here is one approach:

A1 = c("A","B","C")
A2 = c("A","B","C")
A3 = c("A","B", NA)
A4 = c(NA,"B","C")

a <- lapply(list(A1, A2, A3, A4), function(x){
   x[is.na(x)] <- " "
   paste0(x, collapse="")
})

pattern <- c("B", "C")
pattern_2 <- paste0(pattern, collapse="")

sum(sapply(a, function(x){grepl(pattern_2, x)}))
dimitris_ps
  • 5,849
  • 3
  • 29
  • 55
  • Here you are giving pattern as an input ( pattern <- c("B", "C") ). But this is the thing I want to find in the first place. – Sim101011 Mar 27 '15 at 06:50
0

A very bad approach (a lot of loops). It is near to what you are looking for.

library(combinat)

A1 = c("A","B","C")
A2 = c("A","B","C")
A3 = c("A","B", NA)
A4 = c(NA,"B","C")
df <- data_frame(A1, A2, A3, A4)
df[is.na(df)] <- " "

a <- sapply(1:dim(df)[1], function(x) {combn(unique(unlist(apply(df, 1, unique))), x)})

pattern <- unlist(lapply(a, function(x){
  apply(x, 2, function(y){paste0(y, collapse="_")})
}))

a <- lapply(list(A1, A2, A3, A4), function(x){
  x[is.na(x)] <- " "
  paste0(x, collapse="_")
})

df2 <- sapply(a, function(x){sapply(pattern, function(z){grepl(z, x)})})
pattern <- rownames(df2)

occurs <- apply(df2, 1, sum)
pattern <- gsub(" ", "NA", pattern)


pattern <- gsub("_", ", ", pattern)
# pattern <- strsplit(pattern, "_")

for(i in 1:length(pattern)){
  cat("Pattern (", pattern[[i]], ") occurs ", occurs[i], " times\n")
}
dimitris_ps
  • 5,849
  • 3
  • 29
  • 55