3

I have never had a very hard time with regular expressions up until now. I am hoping the solution is not obvious because I have probably spent a few hours on this problem.

This is my string:

<b>Carson Daly</b>: <a href="https://rads.stackoverflow.com/amzn/click/com/B009DA74O8" rel="nofollow noreferrer">Ben Schwartz</a>, Soko, Jacob Escobedo (R 2/28/14)<br>'

I want to extract 'Soko', and 'Jacob Escobedo' as individual strings. If I takes two different patterns for the extractions that is okay with me.

I have tried "\s([A-Za-z0-9]{1}.+?)," and other alterations of that regex to get the data I want but I have had no success. Any help is appreciated.

The names never follow the same tag or the same symbol. The only thing that consistently precedes the names is a space (\s).

Here is another string as an example:

<b>Carson Daly</b>: Wil Wheaton, the Birds of Satan, Courtney Kemp Agboh<br>
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
Jake DeVries
  • 327
  • 3
  • 14

2 Answers2

2

An alternative approach would be to parse the string with an HTML parser, like lxml.

For example, you can use the xpath to find everything between a b tag with Carson Daly text and br tag by checking preceding and following siblings:

from lxml.html import fromstring

l = [
    """<b>Carson Daly</b>: <a href="http://rads.stackoverflow.com/amzn/click/B009DA74O8">Ben Schwartz</a>, Soko, Jacob Escobedo (R 2/28/14)<br>'""",
    """<b>Carson Daly</b>: Wil Wheaton, the Birds of Satan, Courtney Kemp Agboh<br>"""
]

for html in l:
    tree = fromstring(html)
    results = ''
    for element in tree.xpath('//node()[preceding-sibling::b="Carson Daly" and following-sibling::br]'):
        if not isinstance(element, str):
            results += element.text.strip()
        else:
            text = element.strip(':')
            if text:
                results += text.strip()

    print results.split(', ')

It prints:

['Ben Schwartz', 'Soko', 'Jacob Escobedo (R 2/28/14)']
['Wil Wheaton', 'the Birds of Satan', 'Courtney Kemp Agboh']
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • I am not very familiar with lxml or Element Tree which I have heard lxml is derived from? Or used somewhat in conjunction with? Anyways, would you mind explaining to me what is going on in the xpath argument? Or how element.text.strip is able to only strip what is between tags and what is not? – Jake DeVries Jun 09 '14 at 20:05
  • @user3654089 sure. `lxml` is just a third-party module that follows the `ElementTree API`. But it is different from the `ElementTree` from the stdlib in many ways, like it is very fast, the `xpath` support is more complete etc. The string under the `xpath` method call is an `xpath` expression - it is basically a language for navigating and searching in the XML/HTML tree. – alecxe Jun 09 '14 at 20:08
  • @user3654089 this particular xpath finds everything between the `b` tag with `Carson Daly` text and a `br` tag. Hope that helps in understanding. – alecxe Jun 09 '14 at 20:08
  • That helps, thanks. And this is a much better method for parsing the strings than using regular expressions. I appreciate the help. – Jake DeVries Jun 09 '14 at 20:15
  • If "Carson Daly" were to change to "David Letterman" or "Jimmy Kimmel" for instance how can I make the xpath flexible in some sort of loop? Like ::b=+latenighthostlist+ and following.... – Jake DeVries Jun 10 '14 at 16:25
  • @user3654089 could you please elaborate it to a separate SO question? Thanks. – alecxe Jun 10 '14 at 17:35
1

If you want to do it in regex (and with all the disclaimers on that topic), the following regex works with your strings. However, do note that you need to retrieve your matches from capture Group 1. In the online demo, make sure you look at the Group 1 captures in the bottom right pane. :)

<[^<]*</[^>]*>|<.*?>|((?<=,\s)\w[\w ]*\w|\w[\w ]*\w(?=,))

Basically, with the left alternations (separated by |) we match everything we don't want, then the final parentheses on the right capture what we do want.

This is an application of this question about matching a pattern except in certain situations (read that for implementation details including links to Python code).

Community
  • 1
  • 1
zx81
  • 41,100
  • 9
  • 89
  • 105