Why does this regular expression to match two consecutive words not work?

Question

There is a similar question here: Regular Expression For Consecutive Duplicate Words. This addresses the general question of how to solve this problem, whereas I am looking for specific advice on why my solution does not work.

I'm using python regex, and I'm trying to match all consecutively repeated words, such as the bold in:

I am struggling to to make this this work

I tried:

[A-Za-z0-9]* {2}

This is the logic behind this choice of regex: The '[A-Za-z0-9]*' should match any word of any length, and '[A-Za-z0-9]* ' makes it consider the space at the end of the word. Hence [A-Za-z0-9]* {2} should flag a repetition of the previous word with a space at the end. In other words it says "For any word, find cases where it is immediately repeated after a space".

How is my logic flawed here? Why does this regex not work?

No part of your regex mentions a repetition of the *same* word - it just looks for any two words with a space between them — UnholySheep, Mar 04 '18 at 21:42
No repetition does *not* mean the *same* word. Like for instance `[a-z]*` does not mean repeating the *same* character. — Willem Van Onsem, Mar 04 '18 at 21:42

poke · Accepted Answer · 2018-03-04T23:16:48.297

[A-Za-z0-9]* {2}

Quantifiers in regular expressions will always only apply to the element right in front of them. So a \d+ will look for one or more digits but x\d+ will look for a single x, followed by one or more digits.

If you want a quantifier to apply to more than just a single thing, you need to group it first, e.g. (x\d)+. This is a capturing group, so it will actually capture that in the result. This is sometimes undesired if you just want to group things to apply a common quantifier. In that case, you can prefix the group with ?: to make it a non-capturing group: (?:x\d)+.

So, going back to your regular expression, you would have to do it like this:

([A-Za-z0-9]* ){2}

However, this does not actually have any check that the second matched word is the same as the first one. If you want to match for that, you will need to use backreferences. Backreferences allow you to reference a previously captured group within the expression, looking for it again. In your case, this would look like this:

([A-Za-z0-9]*) \1

The \1 will reference the first capturing group, which is ([A-Za-z0-9]*). So the group will match the first word. Then, there is a space, followed by a backreference to the first word again. So this will look for a repetition of the same word separated by a space.

As bobble bubble points out in the comments, there is still a lot one can do to improve the regular expression. While my main concern was to explain the various concepts without focusing too much on your particular example, I guess I still owe you a more robust regular expression for matching two consecutive words within a string that are separated by a space. This would be my take on that:

\b(\w+)\s\1\b

There are a few things that are different to the previous approach: First of all, I’m looking for word boundaries around the whole expression. The \b matches basically when a word starts or ends. This will prevent the expression from matching within other words, e.g. neither foo fooo nor foo oo would be matched.

Then, the regular expression requires at least one character. So empty words won’t be matched. I’m also using \w here which is a more flexible way of including alphanumerical characters. And finally, instead of looking for an actual space, I accept any kind of whitespace between the words, so this could even match tabs or line breaks. It might make sense to add a quantifier there too, i.e. \s+ to allow multiple whitespace characters.

Of course, whether this works better for you, depends a lot on your actual requirements which we won’t be able to tell just from your one example. But this should give you a few ideas on how to continue at least.

Your regex `([A-Za-z0-9]*) \1` is not using any word boundaries thus matching such as [`bar arg`](https://regex101.com/r/aDtIHp/2) (Same @adapap's answer). Yours further even matches any space because it captures the zero-length match before it and checks if there is the "same" zero-length match after it. — bobble bubble, Mar 04 '18 at 22:45
@bobblebubble Those are good points. I’ve expanded my answer above to cover that more properly. Thank you for the feedback! — poke, Mar 04 '18 at 23:17

score 3 · Answer 2 · answered Mar 04 '18 at 21:44

You can match a previous capture group with \1 for the first group, \2 for the second, etc...

import re
s = "I am struggling to to make this this work"
matches = re.findall(r'([A-Za-z0-9]+) \1', s)
print(matches)

>>> ['to', 'this']

If you want both occurrences, add a capture group around \1:

matches = re.findall(r'([A-Za-z0-9]+) (\1)', s)
print(matches)

>>> [('to', 'to'), ('this', 'this')]

Martin · Answer 3 · 2018-03-05T07:00:57.170

At a glance it looks like this will match any two words, not repeated words. If I recall correctly asterisk (*) will match zero or more times, so perhaps you should be using plus (+) for one or more. Then you need to provide a capture and re-use the result of the capture. Additionally the \w can be used for alphanumerical characters for clarity. Also \b can be used to match empty string at word boundary.

Something along the lines of the example below will get you part of the way.

>>> import re
>>> p = re.compile(r'\b(\w+) \1\b')
>>> p.findall('fa fs bau saa saa fa bau eek mu muu bau')
['saa']

These pages may offer some guidance:

score 1 · Answer 4 · answered Mar 04 '18 at 21:48

1

This should work: \b([A-Za-z0-9]+)\s+\1\b

\b matches a word boundary, \s matches whitespace and \1 specifies the first capture group.

>>> s = 'I am struggling to to make this this work'
>>> re.findall(r'\b([A-Za-z0-9]+)\s+\1\b', s)
['to', 'this']

answered Mar 04 '18 at 21:48

Sean Breckenridge

1,932
16
26

johnashu · Answer 5 · 2018-03-04T22:36:13.303

Here is a simple solution not using RegEx.

sentence = 'I am struggling to to make this this work'

def find_duplicates_in_string(words):
    """ Takes in a string and returns any duplicate words
        i.e. "this this"
    """
    duplicates = []
    words = words.split()

    for i in range(len(words) - 1):
        prev_word = words[i]
        word = words[i + 1]
        if word == prev_word:
            duplicates.append(word)
    return duplicates

print(find_duplicates_in_string(sentence))

Why does this regular expression to match two consecutive words not work?

5 Answers5