I have a dataframe with sentences; in some sentences, words get used more than once:
df <- data.frame(Turn = c("well this is what the grumble about do n't they ?",
"it 's like being in a play-group , in n it ?",
"oh is that that steak i got the other night ?",
"well where have the middle sized soda stream bottle gone ?",
"this is a half day , right ? needs a full day",
"yourself , everybody 'd be changing your hair in n it ?",
"cos he finishes at four o'clock on that day anyway .",
"no no no i 'm dave and you 're alan .",
"yeah , i mean the the film was quite long though",
"it had steve martin in it , it 's a comedy",
"oh it is a dreary old day in n it ?",
"no it 's not mother theresa , it 's saint theresa .",
"oh have you seen that face lift job he wants ?",
"yeah bolshoi 's right so which one is it then ?"))
I want to match those sentences in which a word, any word, gets repeated once or more times.
EDIT 1:
The repeated words **can* be adjacent but they need not be. That's the reason why Regular Expression For Consecutive Duplicate Words does not provide an answer to my question.
I've been modestly successful with this code:
df[grepl("(\\w+\\b\\s)\\1{1,}", df$Turn),]
[1] well this is what the grumble about do n't they ?
[2] it 's like being in a play-group , in n it ?
[3] oh is that that steak i got the other night ?
[4] this is a half day , right ? needs a full day
[5] yourself , everybody 'd be changing your hair in n it ?
[6] no no no i 'm dave and you 're alan .
[7] yeah , i mean the the film was quite long though
[8] it had steve martin in it , it 's a comedy
[9] oh it is a dreary old day in n it ?
The success is just modest because some sentences are matched that should not be matched, e.g., yourself , everybody 'd be changing your hair in n it ?
, while others are not matched that should be, e.g., no it 's not mother theresa , it 's saint theresa .
. How can the code be improved to produce exact matches?
Expected result:
df
Turn
2 it 's like being in a play-group , in n it ?
3 oh is that that steak i got the other night ?
5 this is a half day , right ? needs a full day
8 no no no i 'm dave and you 're alan .
9 yeah , i mean the the film was quite long though
10 it had steve martin in it , it 's a comedy
11 oh it is a dreary old day in n it ?
12 no it 's not mother theresa , it 's saint theresa .
EDIT 2:
Another question would be how to define the exact amount of repeated words. The above, imperfect, regex matches words that are repeated at least once. If I change the quantifier to {2}
, thus looking for a triple occurrence of a word, I'd get this code and this result:
df[grepl("(\\w+\\b\\s)\\1{2}", df$Turn),]
[1] no no no i 'm dave and you 're alan . # "no" occurs 3 times
But again the match is imperfect as the expected result would be:
[1] no no no i 'm dave and you 're alan . # "no" occurs 3 times
[2] it had steve martin in it , it 's a comedy # "it" occurs 3 times
Any help is much appreciated!