Regular expression extract and exclude data from string

Question

I have an html like string I want to extract data out of.

s="<ul><li>this is a bullet lev 1&nbsp;</li><li><ul><li><strong>&nbsp;this</strong> is a bullet lev&nbsp;</li></ul></li><li>&nbsp;<ul><li><ul><li>this is a bullet lev 3</li></ul></li></ul></li></ul></ul><strong></li>

"

I want to extract the content of all data containing <li> elements, these are elements that contain something like "this is a bullet lev 1 " between them and not those that contains other <li> as in multilevel elements such as

<li><ul><li><strong>&nbsp;this</strong> is a bullet lev&nbsp;</li></ul></li>

I have written a regular expression for that

<li>([\w &;/<>]*?)</li>

however this ends up pulling the unwanted data as well

<li>this is a bullet lev 1&nbsp;</li>
<li><ul><li><strong>&nbsp;this</strong> is a bullet lev&nbsp;</li>
<li>&nbsp;<ul><li><ul><li>this is a bullet lev 3</li>

while I want it to pull

<li>this is a bullet lev 1&nbsp;</li>
<li><strong>&nbsp;this</strong> is a bullet lev&nbsp;</li>
<li>&nbsp;<ul><li><ul><li>this is a bullet lev 3</li>

The idea is that I want to exclude any results that already have <li> in the extracted data and move ahead.

From research i understood I probably have to use a lookahead or a lookbehind and I gave it a couple of tries but to no avail.

Any clues? I am using python and it builtin re module.

Please read this http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags and this: http://stackoverflow.com/questions/717541/parsing-html-in-python — HVNSweeting, Apr 05 '13 at 16:38

score 0 · Answer 1 · answered Apr 05 '13 at 16:29

0

I've never used BeautifulSoup before but I installed it and without reading any documentation, within 15 minutes:

>>> s="<ul><li>this is a bullet lev 1&nbsp;</li><li><ul><li><strong>&nbsp;this</strong> is a bullet lev&nbsp;</li></ul></li><li>&nbsp;<ul><li><ul><li>this is a bullet lev 3</li></ul></li></ul></li></ul></ul><strong></li>"
>>> from BeautifulSoup import BeautifulSoup
>>> soup = BeautifulSoup(s)
>>> for liRaw in soup.findAll('li'):
...   if liRaw.findParent().findParent().name == u'[document]':
...     print liRaw.text
this is a bullet lev 1&nbsp;
&nbsp;thisis a bullet lev&nbsp;
&nbsp;this is a bullet lev 3

Hope this helps...

answered Apr 05 '13 at 16:29

Vorsprung

32,923
5
39
63

Hi there, what I gave as an example above is a correctly formatted example. I evaluated the parser idea but there is a problem with that. The data I am receiving is not strictly speaking correctly formatted for HTML parsing so I gave up tht idea, however I understand what I am doing is not efficient The parser I tried is the default xml.dom.minidom and not beautifulsoup. I will give that a try as a plan B. Thanks a lot for you suggestion. – Zach Apr 07 '13 at 14:53
might be worth running a proper test with data that is more realistic. Most good html parsers can couple with a similar level of malformedness that a real browser can. For example I just tried removing all the ```` tags from your example string ``s`` and it worked fine, with the same results. I'll certainly be baring BeautifulSoup in mind for future projects – Vorsprung Apr 07 '13 at 18:46

femtoRgon · Accepted Answer · 2013-04-05T16:39:29.977

0

I think this might do the job.

<li>((?!<li>).)*?</li>

Should match any <li> followed by </li> and anything in between as long as they don't contain a <li> (using a lookahead)

This assumes that you don't actually want <li> <ul><li><ul><li>this is a bullet lev 3</li>, but rather: <li>this is a bullet lev 3</li>, in your examples, which seems more consistent with your description.

That said, a parser really would be a better idea for this sort of thing, generally speaking.

edited Apr 05 '13 at 16:39

answered Apr 05 '13 at 16:33

femtoRgon

32,893
7
60
87

That's awesome!! That did the trick. Thanks a lot. I evaluated the parser idea but there is a problem with that (I assume you mean an html parser). The data I am receiving is not strictly speaking correctly formatted for HTML parsing so I gave up tht idea. Again, thank's so much. I will take any other options though, I understand what I am doing is not efficient. – Zach Apr 07 '13 at 14:49

Regular expression extract and exclude data from string

2 Answers2