-1

I am looking in a vector similar to this

x <- c("P1D3,P3A7", 0, 0, "P1D3,P3A7", "P1D3, P2A3, P4D2", 0, "P1D3, P3A7, P2G60", "P1D3,P3A7")

I currently have it using grepl

xPres <- grepl("P",x, ignore.case = FALSE)

Currently if I did

View(xPres)

I would see a vector like this

(TRUE, FALSE, FALSE, TRUE, TRUE, FALSE, TRUE, TRUE)

However, I don't just want to look for anything containing a value other than 0 in it, I want to be able to check to see if a value in a vector or a part of the value in the vector matches some other value or part of a value in the same vector.

The ideal result would produce something like this

(TRUE, FALSE, FALSE, TRUE, FALSE, FALSE, TRUE, TRUE)

The 5th value would change because it does not have any part that is matching, whereas everything else has some part of it that is matching some other value in the same vector, including the 7th value because a portion of it is matching some other value.

The only problem is that every value has "P1D3", because it is present in all of the samples. Is there a way to solve this problem?

Edit: If I created a new vector with

x <- c("P1D3,P3A7", 0, 0, "P1D3,P3A7", "P1D3, P2A3, P4D2", 0, "P1D3, P3A7, P2G60", "P1D3,P3A7", "P1D3, P2A3, P4D2")

the code should produce

(TRUE, FALSE, FALSE, TRUE, TRUE, FALSE, TRUE, TRUE, TRUE)

It seems that finding multiple common substrings is the simplelest way to go, but I do not know the package to download or what to use.

Darwin Chang
  • 59
  • 1
  • 8
  • 1
    I don't understand why 5 is FALSE `"P1D3, P2A3, P4D2"` where as 4 `"P1D3,P3A7"` is TRUE. Both of them don't have 0. Are you looking for longest common substring as in [here](https://stackoverflow.com/questions/1429476/r-longest-common-substring) – akrun Aug 06 '18 at 03:48
  • I said above that I don't just want to look for anything other than 0, I want to be able to match values. In my ideal result "P1D3, P2A3, P4D2" would be FALSE, I just need to get there. I suppose the longest common substring would help, but I would need multiple long common substrings. – Darwin Chang Aug 06 '18 at 03:58
  • 1
    Your question is unclear – Onyambu Aug 06 '18 at 04:09

1 Answers1

0

Using your original x, this seems to give the answer you want:

x <- c("P1D3,P3A7", 0, 0, "P1D3,P3A7", "P1D3, P2A3, P4D2", 0, "P1D3, P3A7, P2G60", "P1D3,P3A7")

Then solve:

spl <- lapply(strsplit(x, ",\\s*"), setdiff, y=c("P1D3","0"))
mapply(function(v,s) any(v %in% unlist(spl[-s])), spl, seq_along(spl))
#[1]  TRUE FALSE FALSE  TRUE FALSE FALSE  TRUE  TRUE

I'm splitting by the commas, then removing the common "P1D3" and "0" values firstly.
Then looping over spl to see if any of the values in that particular set are present anywhere else in spl. This is represented by spl[-s], which returns spl, except for the current set being processed.

thelatemail
  • 91,185
  • 12
  • 128
  • 188
  • Currently the vector I am using is a column of a dataframe. For the function strplit(x, ",\\s*") would the part in quotations still apply? It seems to call it a noncharacter argument – Darwin Chang Aug 07 '18 at 15:27
  • @DarwinChang - pretty sure that's because you have a `factor` column - try `spl <- lapply(strsplit(as.character(DF$yourcolumn), ",\\s*"), setdiff, y=c("P1D3","0"))` and that should fix it. – thelatemail Aug 07 '18 at 21:57
  • It seems to work for the most part. The only issue now is that if there is a x <- c("P1D3,P3A7,P2G60", 0, 0, "P1D3,P3A7", "P1D3, P2A3, P4D2", 0, "P1D3, P3A7, P2G60", "P1D3,P3A7,P2G60") The 4th value "P1D3,P3A7" returns FALSE because it is shorter than the majority. Is there a way to make the value TRUE? – Darwin Chang Aug 08 '18 at 00:09
  • @DarwinChang - like `mapply(function(v,s) any(v %in% unlist(spl[-s])), spl, seq_along(spl))` or something? – thelatemail Aug 08 '18 at 00:20
  • This is great! I think it works! Can you explain what the unlist does? – Darwin Chang Aug 08 '18 at 00:30
  • @DarwinChang It collapses all the values in the `spl` list to one long vector so that they can be compared to the current list component. – thelatemail Aug 08 '18 at 00:32