How to write a regex that pulls unordered list and paragraph preceding it

Question

I have a beautiful soup object, which I've converted to a string and I want to pull all instances of bulleted lists and the paragraph immediately preceding them. An example is the following string:

...
    <p><strong><strong> </strong></strong>It can be hard to admit that rebranding is necessary. Companies can often be attached to their brand, even if it is hurting their sales. Consider rebranding if:</p>
    <ul>
    <li>You are experiencing a decrease in sales and customers</li>
    <li>If your brand design does not reflect what you deliver</li>
    <li>If you want to attract a new target audience</li>
    <li>Management change</li>
    <li><a href="http://www.risingabovethenoise.com/how-to-rebrand-19-questions-ask-before-you-start/" onclick="__gaTracker('send', 'event', 'outbound-article', 'http://www.risingabovethenoise.com/how-to-rebrand-19-questions-ask-before-you-start/', '19 Questions to Ask Yourself Before You Start Rebranding');">19 Questions to Ask Yourself Before You Start Rebranding</a></li>
    </ul>
...

I use the following regex:

re.findall('<p>.*</p>\n<ul>.*</ul>', string)

However, it's returning an empty list. What's the best way to do this?

I deleted my first comment because I decided to go Google your problem. I found https://docs.python.org/2/library/re.html#finding-all-adverbs-and-their-positions. I noted that there is an "r" before the regexp. Maybe you just missed that? — Mark Manning, Jan 31 '16 at 04:47
The `.*` likely is gobbling up the opening `<` of the ``. I'd try `[~<]*` instead. Also there may be more white space than a single `\n`. This is not the approach I'd take. I think any approach like this would be fragile. Your "string" looks like parsable XML so you might consider using XSLT to grab what you want. — mbmast, Jan 31 '16 at 04:49
Wait. You parse the HTML properly with BeautifulSoup, and then you *un*parse it and want to use regexes to get the data out? Use the BeautifulSoup object directly! Don't try to parse HTML with regexes. — user2357112, Jan 31 '16 at 05:26
([Obligatory.](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags)) — user2357112, Jan 31 '16 at 05:27

score 1 · Accepted Answer · answered Jan 31 '16 at 05:48

Don't use regular expressions to parse HTML!

BeautifulSoup can do everything you want easily, elegantly and correctly:

>>> soup = bs4.BeautifulSoup(r"""
    <p><strong><strong> </strong></strong>It can be hard to admit that rebranding is necessary. Companies can often be attached to their brand, even if it is hurting their sales. Consider rebranding if:</p>
    <ul>
    <li>You are experiencing a decrease in sales and customers</li>
    <li>If your brand design does not reflect what you deliver</li>
    <li>If you want to attract a new target audience</li>
    <li>Management change</li>
    <li><a href="http://www.risingabovethenoise.com/how-to-rebrand-19-questions-ask-before-you-start/" onclick="__gaTracker('send', 'event', 'outbound-article', 'http://www.risingabovethenoise.com/how-to-rebrand-19-questions-ask-before-you-start/', '19 Questions to Ask Yourself Before You Start Rebranding');">19 Questions to Ask Yourself Before You Start Rebranding</a></li>
    </ul>
""")
>>> bulleted_lists = soup.findAll('ul')
>>> uls_with_ps = [(ul.findPrevious('p'), ul) for ul in bulleted_lists]

To get a feel for what's going on:

>>> bulleted_lists
[<ul>
<li>You are experiencing a decrease in sales and customers</li>
<li>If your brand design does not reflect what you deliver</li>
<li>If you want to attract a new target audience</li>
<li>Management change</li>
<li><a href="http://www.risingabovethenoise.com/how-to-rebrand-19-questions-ask-before-you-start/" onclick="__gaTracker('send', 'event', 'outbound-article', 'http://www.risingabovethenoise.com/how-to-rebrand-19-questions-ask-before-you-start/', '19 Questions to Ask Yourself Before You Start Rebranding');">19 Questions to Ask Yourself Before You Start Rebranding</a></li>
</ul>]

>>> bulleted_lists[0].findPrevious('p')
<p><strong><strong> </strong></strong>It can be hard to admit that rebranding is necessary. Companies can often be attached to their brand, even if it is hurting their sales. Consider rebranding if:</p>

When I pull
tags from a page, there's lots of
tags in them that I want to filter out that contain either attributes or children tags under them, such as the last
tag with href in the string I provided. I'm using text.findAll(lambda tag: tag.name == 'ul' and not tag.attrs) to return
tags. Is there a good way to modify that so that it also filters out any children
tags with attributes and with children under them. I just want to return
string — Mika Schiller, Jan 31 '16 at 22:37
That's a different question, for that you should open a separate SO question. — taleinat, Feb 01 '16 at 07:14

Learner · Answer 2 · 2016-01-31T06:41:30.020

Why you need regex while beautifulsoup is capable of handling any type of html completely- better you try css selectors here div.Mother div.Son ul li means select all divs with classname Mother then inside it select all divs with classname Son then select ul inside it and finally select all li inside ul.

from bs4 import BeautifulSoup as bs

data = """

    <body>
    <div class="Mother" >
        <div class="Son" >
            <p><strong><strong> </strong></strong>It can be hard to admit that rebranding is necessary. Companies can often be attached to their brand, even if it is hurting their sales. Consider rebranding if:</p>
            <ul>
                <li>You are experiencing a decrease in sales and customers</li>
                <li>If your brand design does not reflect what you deliver</li>
                <li>If you want to attract a new target audience</li>
                <li>Management change</li>
                <li><a href="http://www.risingabovethenoise.com/how-to-rebrand-19-questions-ask-before-you-start/" onclick="__gaTracker('send', 'event', 'outbound-article', 'http://www.risingabovethenoise.com/how-to-rebrand-19-questions-ask-before-you-start/', '19 Questions to Ask Yourself Before You Start Rebranding');">19 Questions to Ask Yourself Before You Start Rebranding</a></li>
            </ul>
        </div>
    </div>
</body>

"""

soup = bs(data,'lxml')
#To grab all inside the ul
for item in soup.select('div.Mother div.Son'):
    print item.text.strip()
print  "="*100
#Just to grab all li    
for li in soup.select('div.Mother div.Son ul li'):
    print li.text.strip()

Output-

It can be hard to admit that rebranding is necessary. Companies can often be attached to their brand, even if it is hurting their sales. Consider rebranding if:

You are experiencing a decrease in sales and customers
If your brand design does not reflect what you deliver
If you want to attract a new target audience
Management change
19 Questions to Ask Yourself Before You Start Rebranding
====================================================================================================
You are experiencing a decrease in sales and customers
If your brand design does not reflect what you deliver
If you want to attract a new target audience
Management change
19 Questions to Ask Yourself Before You Start Rebranding

How to write a regex that pulls unordered list and paragraph preceding it

2 Answers2