Regular Expressions match a block from multiline html text

Question

I have a few html files with two different patterns of a piece of code, where only name="horizon" is constant. I need to get the value of an attribute named as "value". Below are the sample files:-
File1:

<tag1> data
</tag1>
<select size="1" name="horizon">
    <option value="Admin">Admin Users</option>
    <option value="Remote Admin">Remote Admin</option>
</select>

File2:

<othertag some_att="asfa"> data
</othertag>
<select id="realm_17" size="1" name="horizon">
    <option id="option_LoginPage_1" value="Admin Users">Admin Users</option>
    <option id="option_LoginPage_1" value="Global-User">Global-User</option>
</select>

Since the files will have other tags and attributes, I tried writing regular expressions by referring this to filter the required content from the files with these regular expressions.

regex='^(?:.*?)(<(?P<TAG>\w+).+name\=\"horizon\"(?:.*[\n|\r\n?]*)+?<\/(?P=TAG>)'

I have tried this with re.MULTILINE and re.DOTALL but could not get desired text.
I suppose, I would be able to find the required names as list by using re.findall('value\=\"(.*)\",text) once I get the required text.
Please suggest if there is any elegant way to handle the situation.

@ZiTAL I am getting the html text from requests.get(url), so I thought if a regular expression would be easier and clear. — Jeeta, Jan 11 '18 at 11:27
Parsing XML/HTML through regex is only the way when it is impossible to do it through DOM due to *whatever*, If you can use DOM, use DOM because is easier and more secure. — ZiTAL, Jan 11 '18 at 12:10

score 2 · Answer 1 · answered Jan 11 '18 at 11:49

2

I completely agree @ZiTAL when saying that parsing the files as XML would be much faster and nicer.

A few simple lines of code would solve your problem:

import xml.etree.ElementTree as ET

tree = ET.parse('file.xml')
root = tree.getroot()

# If you prefer to parse the text directly do root = ET.fromstring('<root>example</root>')

values = [el.attrib['value'] for el in root.findall('.//option')]

print(values)

answered Jan 11 '18 at 11:49

kazbeel

1,378
19
40

Thanks @kazbeel I tried ElementTree from xml.etree but it gave an error of "mismatched tag" on all my files and text. Then after some more research I found BeautifulSoup module and used it, which gave me perfect results. – Jeeta Jan 16 '18 at 02:46
Glad to hear that you found a solution to your problem! Keep learning, keep growing! – kazbeel Jan 18 '18 at 17:41

score 0 · Answer 2 · answered Jan 11 '18 at 11:57

0

Try this regex !

value="(.*)">

This is simple regex for extracting the value from your html files . This regex shows that extract anything between double quotes & after "value=" & before ">" .

I am also attach the screenshot of the output !

answered Jan 11 '18 at 11:57

Usman

1,983
15
28

Actually there are many tags present with the attribute value, but I need the attribute value of "value" under the tag, which has 'name="horizon"'. So this one was not straight forward. Thanks for the reply @Muhammad Usman – Jeeta Jan 16 '18 at 03:15

score 0 · Accepted Answer · answered Jan 16 '18 at 03:09

I tried the xml.etree.ElementTree module as explained by @kazbeel but it gave me error of "mismatched tag", which I found is the case in most instances of its usage. Then I found this BeautifulSoup module and used it, and it gave the desired results. The following code has covered another file pattern along with the above ones from the question.
File3:

<input id="realm_90" type="hidden" name="horizon" value="RADIUS">

Code:

from bs4 import BeautifulSoup ## module for parsing xml/html files
def get_realms(html_text):
    realms=[]
    soup=BeautifulSoup(html_text, 'lxml')
    in_tag=soup.find(attrs={"name":"horizon"})
    if in_tag.name == 'select':
        for tag in in_tag.find_all():
            realms.append(tag.attrs['value'])
    elif in_tag.name == 'input':
        realms.append(in_tag.attrs['value'])
    return realms

I agree with @ZiTAL to not to use regular expressions when parsing xml/html files because it gets too complicated and there are number of libraries present for them.

Regular Expressions match a block from multiline html text

3 Answers3