0

I tried this code:

re.findall(r"d.*?c", "dcc")

to search for substrings with first letter d and last letter c.

But I get output ['dc']

The correct output should be ['dc', 'dcc'].

What did i do wrong?

Aran-Fey
  • 39,665
  • 11
  • 104
  • 149
Onion123
  • 9
  • 1

4 Answers4

0

What you're looking for isn't possible using any built-in regexp functions that I know of. re.findall() only returns non-overlapping matches. After it matches dc, it looks for another match starting after that. Since the rest of the string is just c, and that doesn't match, it's done, so it just returns ["dc"].

When you use a quantifier like *, you have a choice of making it greedy, or non-greedy -- either it finds the longest or shortest match of the regexp. To do what you want, you need a way of telling it to look for successively longer matches until it can't find anything. There's no simple way to do this. You can use a quantifier with a specific count, but you'd have to loop it in your code:

d.{0}c
d.{1}c
d.{2}c
d.{3}c
...

If you have a regexp with multiple quantified sub-patterns, you'd have to try all combinations of lengths.

Barmar
  • 741,623
  • 53
  • 500
  • 612
0

Your two problems are that .* is greedy while .*? is minimal, and that re.findall() only returns non-overlapping matches. Here's a possible solution:

def findall_inner(expr, text):
    explore = list(re.findall(expr, text))
    matches = set()
    while explore:
        word = explore.pop()
        if len(word) >= 2 and word not in matches:
            explore.extend(re.findall(expr, word[1:])) # try more removing first letter
            explore.extend(re.findall(expr, word[:-1])) # try more removing last letter
        matches.add(word)
    return list(matches)

found = findall_inner(r"d.*c", "dcc")
print(found)

This is a little bit of overkill, using findall instead of search and using >= 2 instead of > 2, as in this case there can only be one non-overlapping match of d.*c and one-character strings cannot match the pattern. But there is some flexibility in it depending on what other kinds of patterns you might want.

user149485
  • 11
  • 2
-1

Try this regex:

^d.*c$

Essentially, you are looking for the start of the string to be d and the end of the string to be c.

sdavis891
  • 101
  • 1
  • 3
  • 10
-2

This is a very important point to understand: a regex engine always returns the leftmost match, even if a "better" match could be found later. When applying a regex to a string, the engine starts at the first character of the string. It tries all possible permutations of the regular expression at the first character. Only if all possibilities have been tried and found to fail, does the engine continue with the second character in the text. So when it find ['dc'] then engine pass 'dc' and continues with second 'c'. So it is impossible to match with ['dcc'].

Saeed Bolhasani
  • 552
  • 3
  • 15