3

I am trying to process transcripts that appears to use voice to text with C#. One major issue I am running into is repeating words and or phrases. I would love to use a RegEx expression to replace them all. Here are some examples:

I, I, I am really wanting to go, but I I am not, am not able to do it.

I would really like to use regex replace so it will turn out something like this

I am really wanting to go, but I am not able to do it.

It appears I have multiple times words repeat either with or without a comma. If I try a replace looking for specific ones, it will replace 2 of the 3 but leave the last two. So it it's becoming a royal pain to come up with a way to looks for multiple repeats and replace them with a single version of that word, so if I have I, I, I..... it is replaced with I or I I and it replaces with just one I.

Also, if there are phrases like:

you know, you know you know

Would like to be able to replace the three with just one

I've tried ones like this: \b(\w+)\s+\1\b, but it doesn't work with commas

I have looked and can't really find anything that looks for comma separated ones. I'm fine if it has to be multiple calls, but just trying to figure it out.

Any help would be appreciated!

markalex
  • 8,623
  • 2
  • 7
  • 32
  • 1
    Try [`(\b[\w]+(?:\s+[\w]+)*)(,?\s*\1)+`](https://regex101.com/r/UHiEoO/1) – markalex Apr 24 '23 at 19:43
  • 1
    What about e.g. `you know, you, know you know` here no replacement? – bobble bubble Apr 24 '23 at 21:22
  • @bobblebubble, do you think it's even possible in general case with regex? – markalex Apr 24 '23 at 21:33
  • @markalex I got this far: [`\b(?:([\w']++)(?=.*?\b(\2?+,? \1\b))[, ]*)+(?=\2)[, ]*`](https://regex101.com/r/lFtP4E/3) in PCRE and [here a .NET version](http://regexstorm.net/tester?p=%5cb%28%3f%3a%28%28%3f%3e%5b%5cw%27%5d%2b%29%29%28%3f%3d.*%3f%5cb%28%28%3f%3e%5c2%3f%29%2c%3f+%5c1%5cb%29%29%5b%2c+%5d*%29%2b%28%3f%3d%5c2%29%5b%2c+%5d*&i=you+know%2c+you%2c+know+you+know%0d%0a1+1+2+3%2c+3+3&r=). I guess it uses what's called *forward references* (correct me if I'm wrong). – bobble bubble Apr 24 '23 at 23:21
  • @markalex, would you know how to do this with both commas and periods? like "you, you" and "you. you" I just ran into that situation, but if I try to add a period, it will replace regular words. – Andrew Harbert Apr 24 '23 at 23:24
  • 1
    Removed my answer. There were too many problems and the regex [grew huge](https://regex101.com/r/wfMGnu/1) when I tried to fix them. Yet it worked best in PCRE (even not flawlessly). Interesting challenge however! :) – bobble bubble Apr 26 '23 at 14:07

3 Answers3

5

You can use (\b\w+(?:\s+\w+)*?)(,?\s*\1)+\b with replacement string $1.

Here

  • (\b\w+(?:\s+\w+)*?) matches one or more words separated by whitespace symbols:
    • \b\w+ matches word symbols from beginning of the word,
    • (?:\s+\w+)*? more than one whitespace symbol followed by word symbols, repeated any number of times (lowest possible).
  • (,?\s*\1)+ matches same words matched by first group (hence \1), separated by optional comma and any number of spaces, repeated more than once.
  • \b insures that last repetition doesn't stop in the middle of the word.

Demo here.

Word of caution: this regex will remove any repetition as asked in question. But sometimes repetitions of words can be valid. Something like We'll move, move far away.


Edit: to accommodate dots between repetition you can use

(\b\w+(?:\s+\w+)*?)([,.]?\s*\1)+\b

It will match following separators between repeated words: ,, ., , , . , , etc.

If you want to match any combination of punctuation and spaces you can use

(\b\w+(?:\s+\w+)*?)([,.\s]*\1)+\b

or even

(\b\w+(?:\s+\w+)*?)([\p{P}\s]*\1)+\b

First one matches any combination of dots, commas and whitespaces, for example ,. , .. Second - any combination of whitespaces and any punctuation marks, for example *;!? ..

Demo here.

markalex
  • 8,623
  • 2
  • 7
  • 32
  • hmm, this works a bit better as well. Any idea how to include items that have a period as well? This freaking app has terrible punctuation., it is also throwing a period in where there should be a comma. – Andrew Harbert Apr 24 '23 at 23:32
  • 1
    @AndrewHarbert, added regex for dots, and more general for all punctuation marks. – markalex Apr 25 '23 at 05:53
  • It is not recommended to put shorthand character classes into character classes when they are the only construct inside the character class. `[\w]` must be re-written as `\w`. – Wiktor Stribiżew Apr 25 '23 at 07:04
  • @WiktorStribiżew, thank you, corrected. Truth be told, I don't know why I've put them there in the first place, must be long evening. – markalex Apr 25 '23 at 07:12
3

You can convert matches of the following regular expression to empty strings.

((?:\w+\s+)*\w+),?\s*(?=\1\b)

The idea is to delete a phrase if the same phrase immediately follows.

Demo

This regular expression has the following elements.

(            begin capture group 1
  (?:        begin a non-capture group
    \w+\s+   match >= 1 word chars followed by >= 1 whitespaces
  )*         end non-capture group and execute >= 0 times
  \w+        match >= 1 word chars
)            end capture group 1
,?           optionally match a comma
\s*          match >= 0 whitespaces
(?=          begin positive lookahead
  \1\b       match content of capture group 1 followed by a word boundary
)            end positive lookahead
Cary Swoveland
  • 106,649
  • 6
  • 63
  • 100
0

Want to thank markalex in the comments for the solution

(\b[\w]+(?:\s+[\w]+)*)(,?\s*\1)+
S.B
  • 13,077
  • 10
  • 22
  • 49
  • 1
    @markalex has posted his solution so feel free to accept it. – Guru Stron Apr 24 '23 at 19:55
  • 3
    Side note: attribution requires link to content and clearly marking quote with quote syntax. Obviously you would not add "thank you" notes to answer (in the same way you would not add those to questions), but rather provide additional information about the quote in your own words. If you really have to convert comment to an answer without any content of your own - mark the answer wiki (https://meta.stackoverflow.com/questions/251597/question-with-no-answers-but-issue-solved-in-the-comments-or-extended-in-chat) – Alexei Levenkov Apr 24 '23 at 20:01
  • Can you edit to explain how this regex is to be used to obtain the desired return values? It would also be helpful to link to it being applied to the examples in the question at regex101.com (as the other answers have done). – Cary Swoveland Apr 24 '23 at 21:02
  • Your answer could be improved with additional supporting information. Please [edit] to add further details, such as citations or documentation, so that others can confirm that your answer is correct. You can find more information on how to write good answers [in the help center](/help/how-to-answer). – Community Apr 26 '23 at 08:38