How to get multiple lines
below
tag with regex in python

Question

I got this HTML tags that I've pulled from a website:

<ul><li>Some Keys in the UL List</li>
</ul>
<li>HKEY_LOCAL_MACHINE\SOFTWARE\Description</li>
<li>HKEY_LOCAL_MACHINE\SOFTWARE\Description\Microsoft</li>
<li>HKEY_LOCAL_MACHINE\SOFTWARE\Description\Microsoft\Rpc</li>
<li>HKEY_LOCAL_MACHINE\SOFTWARE\Description\Microsoft\Rpc\UuidTemporaryData</li>
</ul></ul>

<ul><li>Some objects in the UL LIST</li>
</ul>
<li>_SHuassist.mtx</li>
<li>MuteX.mtx</li>
<li>Something.mtx</li>
<li>Default.mtx</li>
<li>3$5.mtx</li>
</ul></ul>

How can I get the lines(text beteween <li> tags) between the <ul> tags. They don't have any class to diff then.

I don't know too much about BeautifulSoup and Regex.

I want this result as example:

<li>_SHuassist.mtx</li>
<li>MuteX.mtx</li>
<li>Something.mtx</li>
<li>Default.mtx</li>
<li>3$5.mtx</li>

[never, ever, ever, parse HTML with a regex](http://stackoverflow.com/a/1732454/1190844) — nc4pk, May 20 '13 at 20:53

TerryA · Answer 1 · 2013-05-20T21:01:16.230

1

With BeautifulSoup:

>>> html = textabove
>>> from bs4 import BeautifulSoup as BS
>>> soup = BS(html)
>>> for ultag in soup.findAll('ul'):
...     for litag in ultag.findAll('li'):
...         print litag.text

Which prints:

Some Keys in the UL List
Some objects in the UL LIST

To get the latter <li> tags:

>>> for litag in soup.findAll('li'):
...     if litag.text.endswith('.mtx'):
...         print litag
...         
<li>_SHuassist.mtx</li>
<li>MuteX.mtx</li>
<li>Something.mtx</li>
<li>Default.mtx</li>
<li>3$5.mtx</li>

edited May 20 '13 at 21:01

answered May 20 '13 at 20:51

TerryA

58,805
11
114
143

@RodrigoMedeiros Check now :) – TerryA May 20 '13 at 21:02
Thanks for your help/time @Haidro. i'll check this tomorrow at the morning, then i post here the feedback. – Storm May 20 '13 at 22:38
Hello @Haidro, this works perfectly bro but ... The challenge is: Sometimes the object will just have a name and will not end with .mtx as a default to search for them. Like i can search for HKEY as a default in the begning. MANY times ill get texts that i need to know how many
tags has inside the

Storm

May 21 '13 at 11:38

Will `

Some objects in the UL LIST

` always be that exact same text? – TerryA May 21 '13 at 11:41

In this case i just can do too this thing: var = re.findall(r'

(.+).mtx

', page) – Storm May 21 '13 at 11:41

score 0 · Answer 2 · answered May 20 '13 at 20:59

0

you do not need regular expressions to do that take a look at python's HTMLParser

answered May 20 '13 at 20:59

abugnais

188
1
8

score 0 · Accepted Answer · answered May 21 '13 at 13:49

0

soup.find(text='Some objects in the UL LIST').findNext('ul').findAll('li')

Thanks @Haidro you let me got some ideas and searchs, thanks for you help and time.

answered May 21 '13 at 13:49

Storm

59
1
2
12

How to get multiple lines below tag with regex in python

3 Answers3

How to get multiple lines
below
tag with regex in python