0

I got this HTML tags that I've pulled from a website:

<ul><li>Some Keys in the UL List</li>
</ul>
<li>HKEY_LOCAL_MACHINE\SOFTWARE\Description</li>
<li>HKEY_LOCAL_MACHINE\SOFTWARE\Description\Microsoft</li>
<li>HKEY_LOCAL_MACHINE\SOFTWARE\Description\Microsoft\Rpc</li>
<li>HKEY_LOCAL_MACHINE\SOFTWARE\Description\Microsoft\Rpc\UuidTemporaryData</li>
</ul></ul>

<ul><li>Some objects in the UL LIST</li>
</ul>
<li>_SHuassist.mtx</li>
<li>MuteX.mtx</li>
<li>Something.mtx</li>
<li>Default.mtx</li>
<li>3$5.mtx</li>
</ul></ul>

How can I get the lines(text beteween <li> tags) between the <ul> tags. They don't have any class to diff then.

I don't know too much about BeautifulSoup and Regex.

I want this result as example:

<li>_SHuassist.mtx</li>
<li>MuteX.mtx</li>
<li>Something.mtx</li>
<li>Default.mtx</li>
<li>3$5.mtx</li>
TerryA
  • 58,805
  • 11
  • 114
  • 143
Storm
  • 59
  • 1
  • 2
  • 12
  • 1
    [never, ever, ever, parse HTML with a regex](http://stackoverflow.com/a/1732454/1190844) – nc4pk May 20 '13 at 20:53

3 Answers3

1

With BeautifulSoup:

>>> html = textabove
>>> from bs4 import BeautifulSoup as BS
>>> soup = BS(html)
>>> for ultag in soup.findAll('ul'):
...     for litag in ultag.findAll('li'):
...         print litag.text

Which prints:

Some Keys in the UL List
Some objects in the UL LIST

To get the latter <li> tags:

>>> for litag in soup.findAll('li'):
...     if litag.text.endswith('.mtx'):
...         print litag
...         
<li>_SHuassist.mtx</li>
<li>MuteX.mtx</li>
<li>Something.mtx</li>
<li>Default.mtx</li>
<li>3$5.mtx</li>
TerryA
  • 58,805
  • 11
  • 114
  • 143
  • @RodrigoMedeiros Check now :) – TerryA May 20 '13 at 21:02
  • Thanks for your help/time @Haidro. i'll check this tomorrow at the morning, then i post here the feedback. – Storm May 20 '13 at 22:38
  • Hello @Haidro, this works perfectly bro but ... The challenge is: Sometimes the object will just have a name and will not end with .mtx as a default to search for them. Like i can search for HKEY as a default in the begning. MANY times ill get texts that i need to know how many
  • tags has inside the
      cuz they can change in each HTML page and having diferent names like "123g","huji","ospdl","asuidh354#!%$". ):
  • – Storm May 21 '13 at 11:38
  • Will `
    • Some objects in the UL LIST
    ` always be that exact same text?
    – TerryA May 21 '13 at 11:41
  • In this case i just can do too this thing: var = re.findall(r'
  • (.+).mtx
  • ', page) – Storm May 21 '13 at 11:41
  • Yes, in this page they have Titles that i can use to take from the first
      after the
    • title to the last
    – Storm May 21 '13 at 11:42
  • I'm not on my current computer at the moment, so I'm not able to post an accurate answer. Sorry. But if that regex solution works for you, then by all means use it. – TerryA May 21 '13 at 11:47
  • Will works if i have extensions and some words as default. I am trying for days to take the itens from the
    • title
    • (items to take)
    – Storm May 21 '13 at 11:51