1

my question is:

If i have the next df:

df<- data.frame(Respuestas=c("sí, acepto, a veces, no acepto",
                            "no acepto, sí, acepto, a veces, nunca",
                            "a veces, sí, acepto, nunca, bla bla"))
print(df)

                             Respuestas
1        sí, acepto, a veces, no acepto
2 no acepto, sí, acepto, a veces, nunca
3   a veces, sí, acepto, nunca, bla bla

So I needed to extract by columns all string they matched "Respuestas" column with a dictionary, so I applied the @divibisan solution here. So far, all very good, I get my output

vec<-c("sí, acepto", "a veces", "no acepto")

t(apply(df, 1, function(x) 
                          str_extract_all(x[['Respuestas']], vec, simplify = TRUE)))

[,1]         [,2]      [,3]       
[1,] "sí, acepto" "a veces" "no acepto"
[2,] "sí, acepto" "a veces" "no acepto"
[3,] "sí, acepto" "a veces" ""

But, finally I want to get a data frame with the non-match values between "Respuestas" column string and the vec dictionary, something like this:

wishDF<- data.frame(noMatch1=c(NA,
                    "nunca",
                    "nunca"),
                    noMatch2= c(NA,NA, "bla bla"))
print(wishDF)

    noMatch1 noMatch2
1     <NA>     <NA>
2    nunca     <NA>
3    nunca  bla bla

I was trying to use str_detect and invert_match from stringr library in the same way that @divibisan solution, but I dont get good result. What do you recommend me?

Thank you very much!

Tho Vu
  • 1,304
  • 2
  • 8
  • 20

1 Answers1

1

The easiest way to find the content of a string that doesn't match is simply to remove the content that does match:

str_remove(string, pattern)

But this function is vectorized on pattern, so it will remove only one entry from vec each time. We need to go to the implementation: str_remove is an alias for str_replace(string, pattern, "") which is based on the stringi package. So we can do this with:

stringi::stri_replace_all_coll(string, pattern, "", vectorize_all = FALSE)

Finally we want to do that for every row in Respuetas, we can do that simply with map:

map_chr(df$Respuestas,
    ~ stringi::stri_replace_all_coll(.x, vec, "", vectorize_all = FALSE))
# [1] ", , "               ", , , nunca"        ", , nunca, bla bla"

With regards to noMatch1 and noMatch2, it is possible to separate the result based on ",". But I don't know enough about your data to be sure it'll work: do you always have the same number of fields? How do you distinguish between the comma in "si, acepto" and the one between "si, acepto" and "nunca"?

Depending on your data, something like this may or may not work (and may or may not make any sense at all):

df %>%
  mutate(no_match = map_chr(Respuestas,
                            ~ stringi::stri_replace_all_coll(.x, vec, "", vectorize_all = FALSE))) %>%
  separate(col = no_match,
           into = c("first", "second", "third", "fourth"),
           sep = ",",
           extra = "merge",
           fill = "left")
Alexlok
  • 2,999
  • 15
  • 20