Regular expression: matching and grouping a variable number of space separated words

Question

I have a string:

"foo hello world baz 33"

The part between foo and baz will be some number of space separated words (one or more). I want to match this string with an re that will group out each of those words:

>>> re.match(r'foo (<some re here>) baz (\d+)', "foo hello world baz 33").groups() 
('hello', 'world', '33')

The re should be flexible so that it will work in case there are no words around it:

>>> re.match(r'(<some re here>)', "hello world").groups() 
('hello', 'world')

I'm trying variations with ([\w+\s])+, but I'm not able to capture a dynamically determined number of groups. Is this possible?

You will need a `re.findall` and 3 capturing groups: `re.findall(r'^foo (\S+) (\S+) baz (\d+)', 'foo hello world baz 33')`. See [demo](https://ideone.com/rAWt3I). — Wiktor Stribiżew, Oct 29 '15 at 14:01
This won't work. There may be any number of words here. So "foo hello hello hello baz 33" will not match — Neil, Oct 29 '15 at 14:03
Not a problem, I updated the [code](https://ideone.com/rAWt3I). The regex can be `r'^foo (\S+(?:\s+\S+)*) (\S+) baz (\d+)'`. Or do you want to have the words in the first capturing group to be split? Then, it is impossible without additional operations. Just regex won't do. — Wiktor Stribiżew, Oct 29 '15 at 14:04
Since the strings are space-separated words, use a `.split` function. I'd suggest not relying on regex for as simple task as this. — hjpotter92, Oct 29 '15 at 14:06
@stribizhev I think the OP needs all words b/w `foo` and `bar` to be split in an array. — hjpotter92, Oct 29 '15 at 14:06
@Sword '33' there just to show that I need to match and capture other parts of the string. But question was about how to match and capture the part between foo and baz — Neil, Oct 29 '15 at 14:11
@stribizhev your updated version doesn't work for 'foo hello world blah blah baz 33' — Neil, Oct 29 '15 at 14:11
@Neil: It depends on how you need it to work. It [works like this](https://ideone.com/sAGK0P). You will have to split the first element of the resulting array as an additional step. It is not possible to do in Python with a single regex. — Wiktor Stribiżew, Oct 29 '15 at 14:14

Hypothetical Ninja · Accepted Answer · 2015-10-30T04:52:33.377

8

re.match returns result at the start of the string. Use re.search instead.
.*? returns the shortest match between two words/expressions (. means anything, * means 0 or more occurrences and ? means shortest match).

import re
my_str = "foo hello world baz 33"
my_pattern = r'foo\s(.*?)\sbaz'
p = re.search(my_pattern,my_str,re.I)
result =  p.group(1).split()
print result

['hello', 'world']

EDIT:

In case foo or baz are missing, and you need to return the entire string, use an if-else:

if p is not None:
    result = p.group(1).split()
else:
    result = my_str

Why the ? in the pattern:
Suppose there are multiple occurrences of the word baz:

my_str =  "foo hello world baz 33 there is another baz"

using pattern = 'foo\s(.*)\sbaz' will match(longest and greedy) :

'hello world baz 33 there is another'

whereas , using pattern = 'foo\s(.*?)\sbaz' will return the shortest match:

'hello world'

edited Oct 30 '15 at 04:52

answered Oct 29 '15 at 14:17

Hypothetical Ninja

3,920
13
49
75

The non-capturing groups are unnecessary and should be removed. Other than that, this is probably the best solution. Maybe add something to account for the `33` at the end as well. – Tim Pietzcker Oct 29 '15 at 14:19
The OP asked for the match between foo and bar in the comments of the question . Thanks for your feedback , will do the changes :) – Hypothetical Ninja Oct 29 '15 at 14:20
Yeah, specs are a bit hazy. :) – Tim Pietzcker Oct 29 '15 at 14:21
Ideally the pattern should match without assuming foo and/or baz will be there. So 'hello world' is a possible string and should return ('hello', 'world'). I wasn't clear about this in the OP, will clarify – Neil Oct 29 '15 at 14:35
@Sword is '?' required? Without it, the case I just mentioned would be supported – Neil Oct 29 '15 at 14:40
just try with this my_str = "foo hello world baz 33 there is another baz" (with and without question mark) – Hypothetical Ninja Oct 29 '15 at 14:41
does your new edit mean that "return the entire string if foo baz aren't present"? – Hypothetical Ninja Oct 29 '15 at 14:59
@Sword "foo hello world baz 33 there is another baz" doesn't match my re so there is no difference with and without the '?'. An explanation on the difference would help clarify. Also, you understood the the new edit correctly – Neil Oct 30 '15 at 04:41
I have edited my answer to return the entire string, give me some time, I'll explain the difference in the answer itself. – Hypothetical Ninja Oct 30 '15 at 04:44

score 3 · Answer 2 · answered Oct 29 '15 at 14:23

3

[This is not a solution, but I try to explain why is not possible]

What you're after is something like this:

foo\s(\w+\s)+baz\s(\d+)

The cool part would be (\w+\s)+ that would repeat the capturing group. The problem is that most regex flavors, are storing only the last match in that capturing group; old captures are overwritten.

I recommend to loop over the string with a simpler regex.

Hope it helps

answered Oct 29 '15 at 14:23

Sorin Negulescu

99
5

A related-question with a similar answer here: [link](http://stackoverflow.com/questions/3537878/how-to-capture-an-arbitrary-number-of-groups-in-javascript-regexp) – Sorin Negulescu Oct 29 '15 at 14:25

score 0 · Answer 3 · answered Oct 29 '15 at 15:00

use index to find the foo and baz. then split the sub string

def find_between( s, first, last ):
    try:
        start = s.index( first ) + len( first )
        end = s.index( last, start )
        return s[start:end].split()
    except ValueError:
        return ""

s = "foo hello world baz 33"
start = "foo"
end = "baz"
print find_between(s,start,end)

Regular expression: matching and grouping a variable number of space separated words

3 Answers3

Linked