Capturing multiple possibilities with regex

Question

I'm trying to pull multiple names from a listing of names on reddit, i.e.

"Title: /u/foo, /u/bar"
"Title - /u/foo and /u/bar"
"title-/u/foo, /u/bar and /u/foobar"
"Title /u/barfoo (/u/foo and /u/bar)"

and I'm having trouble matching an arbitrary number of names between 1 and maybe 100.

Edit: I don't think I made it clear that the example strings I gave are small snippets of the actual text I'm searching. I'm checking the bodies of posts in /r/KarmaCourt, like these:

http://www.reddit.com/r/KarmaCourt/comments/1ifz0u/ http://www.reddit.com/r/KarmaCourt/comments/28hv73/

The question is revolving around structuring a regex. I don't want to know how to search the sample strings I gave for the names.

I know that r'title.*/u/(\w{3:20})' will match the last name in the line, r'title.*?/u/(\w{3:20})' will match the first in the line, and that I could manually add some number of r'.*?/?u?/?(\w{3:20})?' at the end of of the expression to match more names, but I can't help thinking that's a bad way of doing it.

Would it be better to take the matching string from r'title.*?(?=/u/\w{3:20})(.*)' and pull all the matching r'/u/(\w{3:20})' groups from that, or is there a way to do this all in one step that I'm fundamentally missing?

Note: this project is being done in python, but this is more of a fundamentals question.

score 1 · Accepted Answer · 2015-01-08T23:15:51.973

1

You could use the \G construct if Python supports it.
\G means start search at end of the last match.

This basically lets you qualify the start of a new search (Title in this case)
without actually having to check each time.

Then just do a global search. The name is in group 1 after each match.
I set the multiline modifier. You may not need that if you are testing 1 line at a time.

 # (?mi)(?:(?!\A)\G|^Title).*?/u/(\w{3,20})

 (?xmi-)                       # Inline modifier = 
                               # expanded, multiline, case insensitive
 (?:
      (?! \A )                      # Not beginning of string
      \G                            # If matched before, start at end of last match
   |                              # or,
      ^ Title                       # BOL then 'title'
 )
 .*?                           # non-greedy any char's
 /u/                           # until '/u/'
 ( \w{3,20} )                  # (1), then 3 to 20 word characters

Addendum
Here is the output, will give an idea on how it works.

Output

 **  Grp 0 -  ( pos 0 , len 13 ) 
Title: /u/foo  
 **  Grp 1 -  ( pos 10 , len 3 ) 
foo  

------------

 **  Grp 0 -  ( pos 13 , len 8 ) 
, /u/bar  
 **  Grp 1 -  ( pos 18 , len 3 ) 
bar  

------------

 **  Grp 0 -  ( pos 24 , len 14 ) 
Title - /u/foo  
 **  Grp 1 -  ( pos 35 , len 3 ) 
foo  

------------

 **  Grp 0 -  ( pos 38 , len 11 ) 
 and /u/bar  
 **  Grp 1 -  ( pos 46 , len 3 ) 
bar  

------------

 **  Grp 0 -  ( pos 52 , len 12 ) 
title-/u/foo  
 **  Grp 1 -  ( pos 61 , len 3 ) 
foo  

------------

 **  Grp 0 -  ( pos 64 , len 8 ) 
, /u/bar  
 **  Grp 1 -  ( pos 69 , len 3 ) 
bar  

------------

 **  Grp 0 -  ( pos 72 , len 14 ) 
 and /u/foobar  
 **  Grp 1 -  ( pos 80 , len 6 ) 
foobar  

------------

 **  Grp 0 -  ( pos 89 , len 15 ) 
Title /u/barfoo  
 **  Grp 1 -  ( pos 98 , len 6 ) 
barfoo  

------------

 **  Grp 0 -  ( pos 104 , len 8 ) 
 (/u/foo  
 **  Grp 1 -  ( pos 109 , len 3 ) 
foo  

------------

 **  Grp 0 -  ( pos 112 , len 11 ) 
 and /u/bar  
 **  Grp 1 -  ( pos 120 , len 3 ) 
bar

edited Jan 08 '15 at 23:15

answered Jan 08 '15 at 23:03

If I used this on my non-greedy example, `r'title.*?/u/(\w{3:20})'`, then wouldn't the next match look for another "title" before the next name? This would still end up being 2 steps, similar to my proposed solution above, right? – Humus Jan 08 '15 at 23:13
@Humus - Its a single step. Added some output. Just find all and it will get all these values in one shot. – Jan 08 '15 at 23:17
Thanks! I think I understand this now. Unfortunately, python doesn't have the \G construct, but there's apparently a way to work around that from here: http://stackoverflow.com/questions/529830/do-python-regexes-support-something-like-perls-g – Humus Jan 08 '15 at 23:20
@Humus - I think the newer Pythons support the `\G` anchor, but not sure. – Jan 08 '15 at 23:22
Huh. Just threw it into some code and it seems to be working. Thanks again! – Humus Jan 08 '15 at 23:24
@Humus - Must be using the new _regex_ module. Have a look at this link https://pypi.python.org/pypi/regex – Jan 08 '15 at 23:29
This only seems to be getting me the last match in the series. Mind if I email you the details with code and full test case? – Humus Jan 09 '15 at 00:17
@Humus - I'm not really a Python guru (mostly read only). If you could post it as a sticky link from an on-line tester (or pastebin), I could take a look at it. Do you think its the regex or the functions you are using ? Sounds like `\G` is not available. Look at that link I posted, try to import the new `regex` module. – Jan 09 '15 at 00:41
Nevermind. I found it was an error in my code, not in the regex. While processing the names to lowercase, I managed to accidentally write each name into the same location, overwriting the previous one. – Humus Jan 09 '15 at 01:08
@Humus - Ok. Are you using something like `regex.findall(r'(?mi)(?:(?!\A)\G|^Title).*?/u/(\w{3,20})', data)` , via `import regex` or is it already in the latest Python ? – Jan 09 '15 at 01:11
That's almost exactly what I'm doing, but I'm not including the carrot in front of "title", because it allows me to match more cases where the formatting isn't exact. The regex is working wonderfully now. Thanks for all your help! – Humus Jan 09 '15 at 01:24

score 0 · Answer 2 · answered Jan 08 '15 at 22:53

Heres how you can do it in python. findall will return a list of words that match in the sentence. And once you have that you can iterate over it an get the usernames.

import re

s = ["Title: /u/foo, /u/bar",
     "Title - /u/foo and /u/bar",
     "title-/u/foo, /u/bar and /u/foobar",
     "Title /u/barfoo (/u/foo and /u/bar)"]

for t in s:
    matches = re.findall(r'/u/(\w+)', t)
    print matches

Mazdak · Answer 3 · 2015-01-08T23:09:48.687

0

Realy you dont need regex , you can just use str.split() and str.rstrip() :

>>> l=["Title: /u/foo, /u/bar",
... "Title - /u/foo and /u/bar",
... "title-/u/foo, /u/bar and /u/foobar",
... "Title /u/barfoo (/u/foo and /u/bar)"]
>>> s=[i.split() for i in l]
>>> [[j.split('/u/')[1].rstrip(')') for j in i if '/u/' in j]for i in s]
[['foo,', 'bar'], ['foo', 'bar'], ['foo,', 'bar', 'foobar'], ['barfoo', 'foo', 'bar']]

And if you want to use regex you can just us a positive look-behind :

>>> import re
>>> s=[re.findall(r'(?<=/u/)\w+',i) for i in l]
>>> s
[['foo', 'bar'], ['foo', 'bar'], ['foo', 'bar', 'foobar'], ['barfoo', 'foo', 'bar']]

edited Jan 08 '15 at 23:09

answered Jan 08 '15 at 23:01

Mazdak

105,000
18
159
188

Unfortunately, that assumes that the examples I gave you are the entirety of the text. I'm pulling these from the bodies of /r/KarmaCourt posts, where the text looks more like a full court docket. I'm looking for an answer using regexes, because I do in fact need them to look through something like this: http://www.reddit.com/r/KarmaCourt/comments/1ifz0u/the_people_of_reddit_vs_uvolumezero_for_blatant/ – Humus Jan 08 '15 at 23:05
@Humus ok, check out the edit. i thinks you can use a positive look-behind – Mazdak Jan 08 '15 at 23:10
Yes, a positive lookbehind would give me all the names listed in the post, but I specifically want to classify them by their "title", which is why I'm bothering with searching for the title in the first place. What I'm looking for is a way to pull an arbitrary number of matches from a regex that contains other match elements that only need to be matched once. – Humus Jan 08 '15 at 23:15

Capturing multiple possibilities with regex

3 Answers3