0

I have an html like string I want to extract data out of.

s="<ul><li>this is a bullet lev 1&nbsp;</li><li><ul><li><strong>&nbsp;this</strong> is a bullet lev&nbsp;</li></ul></li><li>&nbsp;<ul><li><ul><li>this is a bullet lev 3</li></ul></li></ul></li></ul></ul><strong></li>

"

I want to extract the content of all data containing <li> elements, these are elements that contain something like "this is a bullet lev 1 " between them and not those that contains other <li> as in multilevel elements such as

<li><ul><li><strong>&nbsp;this</strong> is a bullet lev&nbsp;</li></ul></li>

I have written a regular expression for that

<li>([\w &;/<>]*?)</li>

however this ends up pulling the unwanted data as well

<li>this is a bullet lev 1&nbsp;</li>
<li><ul><li><strong>&nbsp;this</strong> is a bullet lev&nbsp;</li>
<li>&nbsp;<ul><li><ul><li>this is a bullet lev 3</li>

while I want it to pull

<li>this is a bullet lev 1&nbsp;</li>
<li><strong>&nbsp;this</strong> is a bullet lev&nbsp;</li>
<li>&nbsp;<ul><li><ul><li>this is a bullet lev 3</li>

The idea is that I want to exclude any results that already have <li> in the extracted data and move ahead.

From research i understood I probably have to use a lookahead or a lookbehind and I gave it a couple of tries but to no avail.

Any clues? I am using python and it builtin re module.

Jens Erat
  • 37,523
  • 16
  • 80
  • 96
Zach
  • 23
  • 3

2 Answers2

0

I've never used BeautifulSoup before but I installed it and without reading any documentation, within 15 minutes:

>>> s="<ul><li>this is a bullet lev 1&nbsp;</li><li><ul><li><strong>&nbsp;this</strong> is a bullet lev&nbsp;</li></ul></li><li>&nbsp;<ul><li><ul><li>this is a bullet lev 3</li></ul></li></ul></li></ul></ul><strong></li>"
>>> from BeautifulSoup import BeautifulSoup
>>> soup = BeautifulSoup(s)
>>> for liRaw in soup.findAll('li'):
...   if liRaw.findParent().findParent().name == u'[document]':
...     print liRaw.text
this is a bullet lev 1&nbsp;
&nbsp;thisis a bullet lev&nbsp;
&nbsp;this is a bullet lev 3

Hope this helps...

Vorsprung
  • 32,923
  • 5
  • 39
  • 63
  • Hi there, what I gave as an example above is a correctly formatted example. I evaluated the parser idea but there is a problem with that. The data I am receiving is not strictly speaking correctly formatted for HTML parsing so I gave up tht idea, however I understand what I am doing is not efficient The parser I tried is the default xml.dom.minidom and not beautifulsoup. I will give that a try as a plan B. Thanks a lot for you suggestion. – Zach Apr 07 '13 at 14:53
  • might be worth running a proper test with data that is more realistic. Most good html parsers can couple with a similar level of malformedness that a real browser can. For example I just tried removing all the ```` tags from your example string ``s`` and it worked fine, with the same results. I'll certainly be baring BeautifulSoup in mind for future projects – Vorsprung Apr 07 '13 at 18:46
0

I think this might do the job.

<li>((?!<li>).)*?</li>

Should match any <li> followed by </li> and anything in between as long as they don't contain a <li> (using a lookahead)

This assumes that you don't actually want <li>&nbsp;<ul><li><ul><li>this is a bullet lev 3</li>, but rather: <li>this is a bullet lev 3</li>, in your examples, which seems more consistent with your description.

That said, a parser really would be a better idea for this sort of thing, generally speaking.

femtoRgon
  • 32,893
  • 7
  • 60
  • 87
  • That's awesome!! That did the trick. Thanks a lot. I evaluated the parser idea but there is a problem with that (I assume you mean an html parser). The data I am receiving is not strictly speaking correctly formatted for HTML parsing so I gave up tht idea. Again, thank's so much. I will take any other options though, I understand what I am doing is not efficient. – Zach Apr 07 '13 at 14:49