0

I have a text vector with five elements named text2. It is a sample of an actual dataset with over 1,800 rows and multiple columns.

I have reviewed other code solutions in stackoverflow and could not find a match.

Input

text2 <- c("Ian Desmond hits an inside-the-park home run (8) on a line drive down the right-field line. Brendan Rodgers scores. Tony Wolters scores." , "Ian Desmond lines out sharply to center fielder Jason Heyward.", "Ian Desmond hits a grand slam (9) to right center field. Charlie Blackmon scores. Trevor Story scores. David Dahl scores.", "Ian Desmond homers (12) on a fly ball to center field. Daniel Murphy scores.", "Wild pitch by pitcher Jake Faria. Sam Hilliard scores.")

Output I want to know which elements in text2 contain both "Wild pitch" and "scores." I would like both the count and the element numbers. For example, in text2 only one element (the last one) is a match. Thus, the output would contain both the count (1) and the element number (5).

Code tried str_detect(text2, ("Wild pitch|scores"))

kath
  • 7,624
  • 17
  • 32
Metsfan
  • 510
  • 2
  • 8

4 Answers4

2

You're on the right track, however str_detect(text2, ("Wild pitch|scores")) gives you whether Wild pitch OR scores are contained in text2. This gives you your desired output:

ind <- str_detect(text2, "Wild pitch") & str_detect(text2, "scores")
count <- sum(ind)
count 
# 1
pos <- which(ind)
pos 
# 5
kath
  • 7,624
  • 17
  • 32
1

A one-line dplyr solution

require(dplyr)
require(tidyr)

text2 %>% 
  as_tibble() %>% 
  mutate(WP = str_detect(text2,"Wild pitch"),
         S = str_detect(text2,"scores")) %>% 
  summarise(count=sum(WP==T & S==T),
            position=list(which(WP==T & S==T))) %>% 
  unnest(cols=c(position))

Which gives:

# A tibble: 1 x 2
  count position
  <int>    <int>
1     1        5
Maël
  • 45,206
  • 3
  • 29
  • 67
  • when I ran your code I got this error: Error in unnest(., cols = c(position)) : could not find function "unnest" – Metsfan Jul 04 '20 at 14:36
  • It's because it's in the `tidyr` package, i edited my post ;) – Maël Jul 04 '20 at 14:38
  • In `summarise(count=sum(WP==T & S==T)` where are the values of 'T' coming from? What are WP and S being made equal to? – Metsfan Jul 04 '20 at 17:05
  • T means TRUE, so that `count=sum(WP==T & S==T)` sums the number of elements when a sentence has `"Wild pitch"` and `"scores"` in it. – Maël Jul 05 '20 at 22:00
0

You can use the pattern :

pattern <- 'Wild pitch.*scores|scores.*Wild pitch'

To find position, you can use grep

grep(pattern, text2)
#[1] 5

For count you can get the length of grep :

length(grep(pattern, text2))
#Can also use grepl with sum
#sum(grepl(pattern, text2))
#[1] 1
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
0

A one-liner solution with positive lookahead:

res <- c(length(grep("(?=Wild pitch).*scores", text2, perl = T)), 
         grep("(?=Wild pitch).*scores", text2, perl = T))

res
[1] 1 5

If the order of co-occurrence of Wild pitchand scoresis variable, then use this pattern:

"(?=Wild pitch)*(?=scores).*"
Chris Ruehlemann
  • 20,321
  • 4
  • 12
  • 34