Regex to match sentences with adjacent and non-adjacent word repetition in R

Question

I have a dataframe with sentences; in some sentences, words get used more than once:

df <- data.frame(Turn = c("well this is what the grumble about do n't they ?",
                          "it 's like being in a play-group , in n it ?",
                          "oh is that that steak i got the other night ?",
                          "well where have the middle sized soda stream bottle gone ?",
                          "this is a half day , right ? needs a full day",
                          "yourself , everybody 'd be changing your hair in n it ?",
                          "cos he finishes at four o'clock on that day anyway .",
                          "no no no i 'm dave and you 're alan .",
                          "yeah , i mean the the film was quite long though",
                          "it had steve martin in it , it 's a comedy",
                          "oh it is a dreary old day in n it ?",
                          "no it 's not mother theresa , it 's saint theresa .",
                          "oh have you seen that face lift job he wants ?",
                          "yeah bolshoi 's right so which one is it then ?"))

I want to match those sentences in which a word, any word, gets repeated once or more times.

EDIT 1:

The repeated words **can* be adjacent but they need not be. That's the reason why Regular Expression For Consecutive Duplicate Words does not provide an answer to my question.

I've been modestly successful with this code:

df[grepl("(\\w+\\b\\s)\\1{1,}", df$Turn),]
[1] well this is what the grumble about do n't they ?      
[2] it 's like being in a play-group , in n it ?           
[3] oh is that that steak i got the other night ?          
[4] this is a half day , right ? needs a full day          
[5] yourself , everybody 'd be changing your hair in n it ?
[6] no no no i 'm dave and you 're alan .                  
[7] yeah , i mean the the film was quite long though       
[8] it had steve martin in it , it 's a comedy             
[9] oh it is a dreary old day in n it ?

The success is just modest because some sentences are matched that should not be matched, e.g., yourself , everybody 'd be changing your hair in n it ?, while others are not matched that should be, e.g., no it 's not mother theresa , it 's saint theresa .. How can the code be improved to produce exact matches?

Expected result:

df
                                                         Turn
2                it 's like being in a play-group , in n it ?
3               oh is that that steak i got the other night ?
5               this is a half day , right ? needs a full day
8                       no no no i 'm dave and you 're alan .
9            yeah , i mean the the film was quite long though
10                 it had steve martin in it , it 's a comedy
11                        oh it is a dreary old day in n it ?
12        no it 's not mother theresa , it 's saint theresa .

EDIT 2:

Another question would be how to define the exact amount of repeated words. The above, imperfect, regex matches words that are repeated at least once. If I change the quantifier to {2}, thus looking for a triple occurrence of a word, I'd get this code and this result:

df[grepl("(\\w+\\b\\s)\\1{2}", df$Turn),]
[1] no no no i 'm dave and you 're alan .         # "no" occurs 3 times

But again the match is imperfect as the expected result would be:

[1] no no no i 'm dave and you 're alan .          # "no" occurs 3 times
[2] it had steve martin in it , it 's a comedy     # "it" occurs 3 times

Any help is much appreciated!

@WiktorStribiżew `df[grepl("\\b(\\w+)\\s+\\1\\b", df$Turn),]`matches only sentences in which the repeated words are adjacent! It does not match sentences where the repeated items are separated by other words. — Chris Ruehlemann, Feb 28 '20 at 09:56
@WiktorStribiżew `df[grepl("\\b(\\w+)\\b.*\\b\\1\\b", df$Turn),]`matches **some** sentences with non-adjacent repetitions but fails to match others that should be matched! — Chris Ruehlemann, Feb 28 '20 at 10:01
If you do not explain the actual rules and do not provide an expected output, we can't help you more. — Wiktor Stribiżew, Feb 28 '20 at 10:02
Ehm, `df[grepl("\\b(\\w+)\\b.*\\b\\1\\b", df$Turn, perl=TRUE),]` works. You just need to tell R to use the PCRE regex engine. See [the R demo](https://rextester.com/OFVF51113). Actually, `df[grepl("\\b(\\w+)\\b.*?\\b\\1\\b", df$Turn, perl=TRUE),]` can also be used, and it does not make difference if you do not know if the repeated words appear close to one another or one is close to the start and the other is close to the end of the string. — Wiktor Stribiżew, Feb 28 '20 at 10:19
@WiktorStribiżew Yes, `df[grepl("\\b(\\w+)\\b.*\\b\\1\\b", df$Turn, perl=TRUE),]`works, Thank you. What difference does the inclusion of `perl = T`make? — Chris Ruehlemann, Feb 28 '20 at 10:21
It has been already reported that using capturing groups and backreferences to them alongside with indefinitely quantified patterns in TRE patterns yield unexpected results. — Wiktor Stribiżew, Feb 28 '20 at 10:24
@WiktorStribiżew How would the code have to be changed if I want to match just sentences with a *certain number* of repetitions? — Chris Ruehlemann, Feb 28 '20 at 10:26
Are you changing the question now? It depends on what number: no more than, no fewer than or exacly N times. — Wiktor Stribiżew, Feb 28 '20 at 10:27
@WiktorStribiżew Let's say I'd just want to match sentences where any word occurs exactly 3 times — Chris Ruehlemann, Feb 28 '20 at 10:29
Ok, so, shall we reopen and post the `df[grepl("\\b(\\w+)\\b.*\\b\\1\\b", df$Turn, perl=TRUE),]` answer or do you want to edit the question? — Wiktor Stribiżew, Feb 28 '20 at 10:34
@WiktorStribiżew I've edited the question; see **EDIT 1** and **EDIT 2** and would be grateful if you, or anybody else, could reopen the post — Chris Ruehlemann, Feb 28 '20 at 15:03
I think you need something like `exact_amount %in% table(strsplit(x, "\\W+"))`, see [this R demo](https://rextester.com/UIUDV66542) where `exact_amount <- 3` — Wiktor Stribiżew, Feb 28 '20 at 15:32

Hsiang Yun Chan · Accepted Answer · 2020-03-02T09:03:55.553

An option for defining the exact amount of repeated words.

extract sentences in which the same words occur 3 times

change regex.

(\s?\b\w+\b\s)(.*\1){2}

(\s?\b\w+\b\s) captured by Group 1
- \s? : blank space occurs zero or once.
- \b\w+\b : the exact word character.
- \s : blank space occurs once.
  
  (.*\1) captured by Group 2
  - (.*\1) : any characters that occur zero or more times before Group 1 matches again.
  - (.*\1){2} : Group 2 matches twice.

Code

df$Turn[grepl("(\\s?\\b\\w+\\b\\s)(.*\\1){2}", df$Turn, perl = T)]
# [1] "no no no i 'm dave and you 're alan ."     
# [2] "it had steve martin in it , it 's a comedy"

Use strsplit(split="\\s") split sentences into words.
- use sapply and table to count the number of occurrence of words in each list element, and then select sentences that satisfy the requirement.

Code

library(magrittr)
df$Turn %<>% as.character()
s<-strsplit(df$Turn,"\\s") %>% sapply(.,function(i)table(i) %>% .[.==3])
df$Turn[which(s!=0)]
# [1] "no no no i 'm dave and you 're alan ."     
# [2] "it had steve martin in it , it 's a comedy"

Hope this may help you :)

score 1 · Answer 2 · answered Mar 01 '20 at 03:46

I would rather take another pass to handle this task. First, I added a group variable to the original data frame. Then, I counted how many times each word appears in each sentence and created a data frame, which is mytemp.

library(tidyverse)

mutate(df, id = 1:n()) -> df

mutate(df, id = 1:n()) %>% 
mutate(word = strsplit(x = Turn, split = " ")) %>% 
unnest(word) %>% 
count(id, word, name = "frequency", sort = TRUE) -> mytemp

Using this data frame, it is straightforward to identify sentences. I subset the data and obtained id for the sentences that have a word appearing three times. I similarly identified words that appeared more than once and obtained id. Finally, I subset the original data using the id numbers in three and twice.

# Search words that appear 3 times 

three <- filter(mytemp, frequency == 3) %>% 
         pull(id) %>% 
         unique()

# Serach words that appear more than once.

twice <- filter(mytemp, frequency > 1) %>% 
         pull(id) %>% 
         unique()

# Go back to the original data and handle subsetting
filter(df, id %in% three)

  Turn                                          id
  <chr>                                      <int>
1 no no no i 'm dave and you 're alan .          8
2 it had steve martin in it , it 's a comedy    10

filter(df, id %in% twice)

  Turn                                                   id
  <chr>                                               <int>
1 it 's like being in a play-group , in n it ?            2
2 oh is that that steak i got the other night ?           3
3 this is a half day , right ? needs a full day           5
4 no no no i 'm dave and you 're alan .                   8
5 yeah , i mean the the film was quite long though        9
6 it had steve martin in it , it 's a comedy             10
7 oh it is a dreary old day in n it ?                    11
8 no it 's not mother theresa , it 's saint theresa .    12

Regex to match sentences with adjacent and non-adjacent word repetition in R

2 Answers2