Regular Expressions: Find Names in String using Python

Question

I have never had a very hard time with regular expressions up until now. I am hoping the solution is not obvious because I have probably spent a few hours on this problem.

This is my string:

<b>Carson Daly</b>: <a href="https://rads.stackoverflow.com/amzn/click/com/B009DA74O8" rel="nofollow noreferrer">Ben Schwartz</a>, Soko, Jacob Escobedo (R 2/28/14)<br>'

I want to extract 'Soko', and 'Jacob Escobedo' as individual strings. If I takes two different patterns for the extractions that is okay with me.

I have tried "\s([A-Za-z0-9]{1}.+?)," and other alterations of that regex to get the data I want but I have had no success. Any help is appreciated.

The names never follow the same tag or the same symbol. The only thing that consistently precedes the names is a space (\s).

Here is another string as an example:

<b>Carson Daly</b>: Wil Wheaton, the Birds of Satan, Courtney Kemp Agboh<br>

These names `, Soko, Jacob Escobedo` will always be after a `<\a>`? — Caio Oliveira, Jun 06 '14 at 22:00
Why those names and not the others? You aren't explaining your rules very clearly. — jonrsharpe, Jun 06 '14 at 22:02
You need to provide more information for this problem. There needs to be some case that makes Soko and Jacob Escobedo special here... like if they always came after a as @CaioOliveira suggested — wallacer, Jun 06 '14 at 22:03
It is still not clear where these names are in a string. Is it an HTML page you are parsing? Does `Carson Daly` text required before the names? Is `(R 2/28/14)` part of a text relevant? — alecxe, Jun 06 '14 at 22:08
Apologies for not being clear. I have a regex pattern that can extract the names between the a tags but when the atags do not exist, I run into trouble and (R 'date') may or may not exist. If it does not exist then the
tag will follow. — Jake DeVries, Jun 06 '14 at 22:09
@user3654089 thanks, could you also provide more examples that "cover" other cases you've described? — alecxe, Jun 06 '14 at 22:10
I think you should give us some more examples *(and its respective desirable results)* — Caio Oliveira, Jun 06 '14 at 22:10
The string itself contains the name of a late night host and the guests that appear on the program for the evening. I want to extract the guest names individually. When the guest falls between an a tag because there is a product or link relevant to the guest, the guest can be easily extracted but when the guest has no link pertinent, things get difficult. Here is another string as an example. Carson Daly: Wil Wheaton, the Birds of Satan, Courtney Kemp Agboh — Jake DeVries, Jun 06 '14 at 22:16
Use a regex to strip out the `` tags first then. All your examples will be the same then, no? Replace `,?\s*` with an empty string.. — mpen, Jun 06 '14 at 22:22

alecxe · Accepted Answer · 2014-06-06T23:04:12.100

2

An alternative approach would be to parse the string with an HTML parser, like lxml.

For example, you can use the xpath to find everything between a b tag with Carson Daly text and br tag by checking preceding and following siblings:

from lxml.html import fromstring

l = [
    """<b>Carson Daly</b>: <a href="http://rads.stackoverflow.com/amzn/click/B009DA74O8">Ben Schwartz</a>, Soko, Jacob Escobedo (R 2/28/14)<br>'""",
    """<b>Carson Daly</b>: Wil Wheaton, the Birds of Satan, Courtney Kemp Agboh<br>"""
]

for html in l:
    tree = fromstring(html)
    results = ''
    for element in tree.xpath('//node()[preceding-sibling::b="Carson Daly" and following-sibling::br]'):
        if not isinstance(element, str):
            results += element.text.strip()
        else:
            text = element.strip(':')
            if text:
                results += text.strip()

    print results.split(', ')

It prints:

['Ben Schwartz', 'Soko', 'Jacob Escobedo (R 2/28/14)']
['Wil Wheaton', 'the Birds of Satan', 'Courtney Kemp Agboh']

edited Jun 06 '14 at 23:04

answered Jun 06 '14 at 22:01

alecxe

462,703
120
1,088
1,195

I am not very familiar with lxml or Element Tree which I have heard lxml is derived from? Or used somewhat in conjunction with? Anyways, would you mind explaining to me what is going on in the xpath argument? Or how element.text.strip is able to only strip what is between tags and what is not? – Jake DeVries Jun 09 '14 at 20:05
@user3654089 sure. `lxml` is just a third-party module that follows the `ElementTree API`. But it is different from the `ElementTree` from the stdlib in many ways, like it is very fast, the `xpath` support is more complete etc. The string under the `xpath` method call is an `xpath` expression - it is basically a language for navigating and searching in the XML/HTML tree. – alecxe Jun 09 '14 at 20:08
@user3654089 this particular xpath finds everything between the `b` tag with `Carson Daly` text and a `br` tag. Hope that helps in understanding. – alecxe Jun 09 '14 at 20:08
That helps, thanks. And this is a much better method for parsing the strings than using regular expressions. I appreciate the help. – Jake DeVries Jun 09 '14 at 20:15
If "Carson Daly" were to change to "David Letterman" or "Jimmy Kimmel" for instance how can I make the xpath flexible in some sort of loop? Like ::b=+latenighthostlist+ and following.... – Jake DeVries Jun 10 '14 at 16:25
@user3654089 could you please elaborate it to a separate SO question? Thanks. – alecxe Jun 10 '14 at 17:35

score 1 · Answer 2 · edited May 23 '17 at 12:27

If you want to do it in regex (and with all the disclaimers on that topic), the following regex works with your strings. However, do note that you need to retrieve your matches from capture Group 1. In the online demo, make sure you look at the Group 1 captures in the bottom right pane. :)

<[^<]*</[^>]*>|<.*?>|((?<=,\s)\w[\w ]*\w|\w[\w ]*\w(?=,))

Basically, with the left alternations (separated by |) we match everything we don't want, then the final parentheses on the right capture what we do want.

This is an application of this question about matching a pattern except in certain situations (read that for implementation details including links to Python code).

Regular Expressions: Find Names in String using Python

2 Answers2