0

I have a string and I want to extract matches from it using regex. The string is as follows:

you and he and she and me

And my regex is (so far):

(\w+) and (\w+)

What I want is it should give this result:

(you, he), (he, she), (she, me)

but current result just include 2 matches, which are

(you, he), (she, me)

How to achieve this?

Jan
  • 42,290
  • 8
  • 54
  • 79
Fhals
  • 9
  • 1

3 Answers3

1

What you're asking for is overlapping regexes.

This is how you do it:

import re                                                                       

s = "you and he and she and me"                                                 

print re.findall(r'(?=\b(\w+) and (\w+)\b)', s)

In fact it does such a good job looking for overlaps, you'll need the \b's I added to indicate you want to match the word boundaries. Otherwise you get:

[('you', 'he'), ('ou', 'he'), ('u', 'he'), ('he', 'she'), ('e', 'she'), ('she', 'me'), ('he', 'me'), ('e', 'me')]
Community
  • 1
  • 1
rrauenza
  • 6,285
  • 4
  • 32
  • 57
0

You can use the zero width positive lookahead like:

(?=(?:^|\s)(\w+)\s+and\s+(\w+))
  • The zero width lookahead pattern starts with (?= and ends in last )

  • (?:^|\s) is a non-captured group, ensuring the desired patterns are at the start or followed by whitespace

  • (\w+)\s+and\s+(\w+), gets out desired pattern with first and second captured groups

Example:

In [11]: s = 'you and he and she and me'

In [12]: re.findall(r'(?=(?:^|\s)(\w+)\s+and\s+(\w+))', s)
Out[12]: [('you', 'he'), ('he', 'she'), ('she', 'me')]
heemayl
  • 39,294
  • 7
  • 70
  • 76
0

As others pointed it, what you're looking for is called overlapping matches.
With the newer regex module, you could stick to your initial approach and apply another flag:

import regex as re

string = "you and he and she and me"
rx = r'\b(\w+) and (\w+)\b'

matches = re.findall(rx, string, overlapped=True)
print matches
# [('you', 'he'), ('he', 'she'), ('she', 'me')]

Hint: you'll need word boundaries on top (\b), otherwise you'll get unexpected results.

Jan
  • 42,290
  • 8
  • 54
  • 79