How do I select a piece of text with regex and then use findall?

Question

I have the following code. How I select the name2 and name3 from the "suite-title">author</h3> field with regex?

<li class="suite-item">
                            <h3 class="suite-title">thinker</h3>
                            <ul class="suite-title-list">
                                                                    <li class="suite-title">
                                        <a href="https://www.sitename.org/profile/name1/fl/suite" title="name1"class="suite-name">name1</a>
                                        </li>
                                                            </ul>
                        </li>
                                            <li class="suite-item">
                            <h3 class="suite-title">author</h3>
                            <ul class="suite-title-list">
                                                                    <li class="suite-title">
                                        <a href="https://www.sitename.org/profile/name2/fl/suite" title="name2"class="suite-name">name2</a>
                                        </li>
                                                                    <li class="suite-title">
                                        <a href="https://www.sitename.org/profile/name3/fl/suite" title="name3"class="suite-name">name3</a>
                                        </li>
                                                            </ul>

I could write the following code that just gives me the name2:

    re.findall( r'<h3 class=\"suite-title\">author</h3>\s+<ul class=\"suite-title-list\">\s+<li class=\"suite-title\">\s+<a href.*\">(.*?)</a>'
, string = content.text)

How about using an HTML parser like beautifulsoup? – Klaus D. Jan 18 '20 at 06:04 — Klaus D., Jan 18 '20 at 06:04

spacecowboy · Accepted Answer · 2020-01-18T09:03:39.580

You could find a way to do this with a global regex, but it is probably better to split the html into a list of lines and process each line, setting a flag when you are in the right place and using re to capture the names.

in_author = False
authors = []
for line in html.split('\n'):
    if line.count('<h3'):
        if line.count('author</h3>'):
            in_author = True
        else:
            in_author = False
    if in_author and line.count('<a'):
        match = re.match(r' *<a.*>(.*)</a>', line)
        if bool(match):
            authors.append(match.group(1))
print(authors)

As mentioned in the comments, using an HTML parser is probably a better way to go:

from bs4 import BeautifulSoup
authors = []
soup = BeautifulSoup(html, 'html.parser')
for link in soup.find_all('a'):
    if link.find_previous('h3').string == 'author':
        authors.append(link.string)
print(authors)

While using any html parsing tool comes with a bit of a learning curve, it is well worth the effort if you plan on doing more structured markup text processing in the future.

How do I select a piece of text with regex and then use findall?

1 Answers1