Find in R elements in same text vector that contain two substrings

Question

I have a text vector with five elements named text2. It is a sample of an actual dataset with over 1,800 rows and multiple columns.

I have reviewed other code solutions in stackoverflow and could not find a match.

Input

text2 <- c("Ian Desmond hits an inside-the-park home run (8) on a line drive down the right-field line. Brendan Rodgers scores. Tony Wolters scores." , "Ian Desmond lines out sharply to center fielder Jason Heyward.", "Ian Desmond hits a grand slam (9) to right center field. Charlie Blackmon scores. Trevor Story scores. David Dahl scores.", "Ian Desmond homers (12) on a fly ball to center field. Daniel Murphy scores.", "Wild pitch by pitcher Jake Faria. Sam Hilliard scores.")

Output I want to know which elements in text2 contain both "Wild pitch" and "scores." I would like both the count and the element numbers. For example, in text2 only one element (the last one) is a match. Thus, the output would contain both the count (1) and the element number (5).

Code tried str_detect(text2, ("Wild pitch|scores"))

score 2 · Accepted Answer · answered Jul 04 '20 at 13:54

You're on the right track, however str_detect(text2, ("Wild pitch|scores")) gives you whether Wild pitch OR scores are contained in text2. This gives you your desired output:

ind <- str_detect(text2, "Wild pitch") & str_detect(text2, "scores")
count <- sum(ind)
count 
# 1
pos <- which(ind)
pos 
# 5

Maël · Answer 2 · 2020-07-04T14:38:18.160

1

A one-line dplyr solution

require(dplyr)
require(tidyr)

text2 %>% 
  as_tibble() %>% 
  mutate(WP = str_detect(text2,"Wild pitch"),
         S = str_detect(text2,"scores")) %>% 
  summarise(count=sum(WP==T & S==T),
            position=list(which(WP==T & S==T))) %>% 
  unnest(cols=c(position))

Which gives:

# A tibble: 1 x 2
  count position
  <int>    <int>
1     1        5

edited Jul 04 '20 at 14:38

answered Jul 04 '20 at 14:22

Maël

45,206
3
29
67

when I ran your code I got this error: Error in unnest(., cols = c(position)) : could not find function "unnest" – Metsfan Jul 04 '20 at 14:36
It's because it's in the `tidyr` package, i edited my post ;) – Maël Jul 04 '20 at 14:38
In `summarise(count=sum(WP==T & S==T)` where are the values of 'T' coming from? What are WP and S being made equal to? – Metsfan Jul 04 '20 at 17:05
T means TRUE, so that `count=sum(WP==T & S==T)` sums the number of elements when a sentence has `"Wild pitch"` and `"scores"` in it. – Maël Jul 05 '20 at 22:00

score 0 · Answer 3 · answered Jul 04 '20 at 14:27

You can use the pattern :

pattern <- 'Wild pitch.*scores|scores.*Wild pitch'

To find position, you can use grep

grep(pattern, text2)
#[1] 5

For count you can get the length of grep :

length(grep(pattern, text2))
#Can also use grepl with sum
#sum(grepl(pattern, text2))
#[1] 1

Chris Ruehlemann · Answer 4 · 2020-07-04T15:48:45.320

0

A one-liner solution with positive lookahead:

res <- c(length(grep("(?=Wild pitch).*scores", text2, perl = T)), 
         grep("(?=Wild pitch).*scores", text2, perl = T))

res
[1] 1 5

If the order of co-occurrence of Wild pitchand scoresis variable, then use this pattern:

"(?=Wild pitch)*(?=scores).*"

edited Jul 04 '20 at 15:48

answered Jul 04 '20 at 14:59

Chris Ruehlemann

20,321
4
12
34

Find in R elements in same text vector that contain two substrings

4 Answers4