Consider the following reproducible dataset which I created on the basis of the Donald Trump-Tweets dataset (which can be found here):
df <- tibble(target = c(rep("jeb-bush", 2), rep("jeb-bush-supporters", 2),
"jeb-staffer", rep("the-media", 5)),
tweet_id = seq(1, 10, 1))
It consists of two columns, the target group of the tweets and the tweet_id:
# A tibble: 10 x 2
target tweet_id
<chr> <dbl>
1 jeb-bush 1
2 jeb-bush 2
3 jeb-bush-supporters 3
4 jeb-bush-supporters 4
5 jeb-staffer 5
6 the-media 6
7 the-media 7
8 the-media 8
9 the-media 9
10 the-media 10
Goal:
Whenever an element in target
starts with jeb
, I want to extract the string pattern after the -
. And whenever there are multiple -
in an element which starts with jeb
, I want to extract the string pattern after the LAST -
(which in this example dataset would only be the case for jeb-bush-supporters
). For every element that doesn't start with jeb
, I just want to create the string other
.
In the end, it should look like this:
# A tibble: 10 x 3
target tweet_id new_var
<chr> <dbl> <chr>
1 jeb-bush 1 bush
2 jeb-bush 2 bush
3 jeb-bush-supporters 3 supporters
4 jeb-bush-supporters 4 supporters
5 jeb-staffer 5 staffer
6 the-media 6 other
7 the-media 7 other
8 the-media 8 other
9 the-media 9 other
10 the-media 10 other
What I have tried:
I have actually managed to create the desired output with the following code:
df %>%
mutate(new_var = case_when(str_detect(target, "^jeb-[a-z]+$") ~
str_extract(target, "(?<=[a-z]{3}-)[a-z]+"),
str_detect(target, "^jeb-[a-z]+-[a-z]+") ~
str_extract(target, "(?<=[a-z]{3}-[a-z]{4}-)[a-z]+"),
TRUE ~ "other"))
But the problem is this:
In the second str_extract
statement, I have to define the exact amount of letters in the 'Positive Look Behind' ([a-z]{4}
). Otherwise R is complaining about needing a "bounded maximum length". But what if I don't know the exact pattern length or if it would vary from element to element?
Alternatively, I tried to work with capture groups instead of with "Look Arounds". Therefore, I tried to include str_match
to define what I WANT to extract instead of what I DON'T want to extract:
df %>%
mutate(new_var = case_when(str_detect(target, "^jeb-[a-z]+$") ~
str_match(target, "jeb-([a-z]+)"),
str_detect(target, "^jeb-[a-z]+-[a-z]+") ~
str_match(target, "jeb-[a-z]+-([a-z]+)"),
TRUE ~ "other"))
But then I receive this error message:
Error: Problem with `mutate()` input `new_var`.
x `str_detect(target, "^jeb-[a-z]+$") ~ str_match(target, "jeb-([a-z]+)")`, `str_detect(target, "^jeb-[a-z]+-[a-z]+") ~ str_match(target,
"jeb-[a-z]{4}-([a-z]+)")` must be length 10 or one, not 20.
i Input `new_var` is `case_when(...)`.
Question:
Ultimately, I want to know if there is a concise way of extracting specific string patterns in a case_when-statement. How would I work around the problem that I stated here, when I wouldn't be able to use "Look Arounds" (because I can't define a bounded maximum length) nor capture groups (because str_match
would return a vector of length 20 and not of the original size 10 or one)?