str_detect exclusion of word combinations

Question

I am doing a content analysis of french politician's twitter posts dealing with immigration. As I only recently started working with strings, I am currently facing some problems regarding the exclusion of word combinations. Notably, I defined the word "identité" (or words with the same word stem) as an indicator of a tweet dealing with immigration. However, the word combination "carte d'identité" (ID card) is never actually used in this context. Therefore i would like to exclude it.

The original code looks like this:

 mutate(identit = str_detect(full_text, "identit"))

So far, I tried to exclude it by using the hat operator.

  mutate(identit = str_detect(full_text, "[^carte d']identité"))

which however actually includes it and articles like l'immigration and d'immigration, whereas words without articles identité or identitaire are excluded.

edit: In order to make it replicable:

 df <- data.frame(text = c("Ma carte d\'identité","Notre identité", "ce n'est pas l'identité du pays", "d'identité", "tasty buns"))
df_detect <- df %>% mutate(identit = str_detect(text, "*???*"))

(Basically, in this dataframe I'd like str_detect to only detect ,"Notre identité", "ce n'est pas l'identité du pays", "d'identité")

Welcome to SO! Which is your desired output and your dataset? Please read [this](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). However something like this without any further informations could work, maybe `df <- data.frame(phrases = c('aDASD carte d\'identité','FSDFSDG identité')); df[!grepl("carte d\'identité", df$phrases),]` — s__, Mar 10 '23 at 14:38
This answer may help [https://stackoverflow.com/questions/2078915/a-regular-expression-to-exclude-a-word-string](https://stackoverflow.com/questions/2078915/a-regular-expression-to-exclude-a-word-string) — mfg3z0, Mar 10 '23 at 14:47
Hi @s__ , thank you! I hope i got you write and added a replicable dataframe. (plus further description) — Samnang, Mar 10 '23 at 15:25

score 0 · Answer 1 · answered Mar 10 '23 at 16:02

You were on the right track with exclusion [^carte d'] but this treats it as a separate individual letters to exclude preceding your target. You want to use a look around, in particular the "not preceded by" (?<!)

str_detect(text, "(?<!carte d\')identité")

In case you don't have it already, here's a link to the stringr cheatsheet that has a lot of good information about regular expressions. I know personally reference this no less than 3 times a day if I'm working with text data

str_detect exclusion of word combinations

1 Answers1