I have an html like string I want to extract data out of.
s="<ul><li>this is a bullet lev 1 </li><li><ul><li><strong> this</strong> is a bullet lev </li></ul></li><li> <ul><li><ul><li>this is a bullet lev 3</li></ul></li></ul></li></ul></ul><strong></li>
"
I want to extract the content of all data containing <li> elements, these are elements that contain something like "this is a bullet lev 1 " between them and not those that contains other <li> as in multilevel elements such as
<li><ul><li><strong> this</strong> is a bullet lev </li></ul></li>
I have written a regular expression for that
<li>([\w &;/<>]*?)</li>
however this ends up pulling the unwanted data as well
<li>this is a bullet lev 1 </li>
<li><ul><li><strong> this</strong> is a bullet lev </li>
<li> <ul><li><ul><li>this is a bullet lev 3</li>
while I want it to pull
<li>this is a bullet lev 1 </li>
<li><strong> this</strong> is a bullet lev </li>
<li> <ul><li><ul><li>this is a bullet lev 3</li>
The idea is that I want to exclude any results that already have <li> in the extracted data and move ahead.
From research i understood I probably have to use a lookahead or a lookbehind and I gave it a couple of tries but to no avail.
Any clues? I am using python and it builtin re module.