2

I've searched but couldn't find the right answer, maybe my search query is not correct. But as for the question, I have below in html document for dropdown values.

   <select style="background: red; color: #fff; padding: 5px;" class="mainNewcat" size="1">
<option>My New List</option>
<option value="http://www.google.com/value1.html">Value 1</option><option value="http://www.google.com/value2.html">Value 2</option><option value="http://www.google.com/value3.html">Value 3</option> </select>
<select style="background: green; color: #fff; padding: 5px;" class="mainOldcat" size="1">
<option>My Old List</option>
<option value="http://www.yahoo.com/cat1.html">Category 1</option><option value="http://www.yahoo.com/cat2.html">Category 2</option><option value="http://www.yahoo.com/cat3.html">Category 3</option> </select>

What i'm looking for is url and text from only 'My New List'. So far regex solution I have is to first search for option value block within 'My New List', and then another regex to search for url and text from first result, like below which is using python's RE module.

main_regex = re.compile('<select.+?\n.+?New.+?\n(.+?)<\/select>').findall(html)
final_regex = re.compile('value="(.+?)">(.+?)</option>').findall(main_regex)

Is there a better solution than what I have? or should I use some parser instead of regex?

Malhar
  • 503
  • 1
  • 4
  • 6

1 Answers1

0

How about you parse the HTML with, well, an HTML parser? Example using BeautifulSoup:

from bs4 import BeautifulSoup

data = """
<select style="background: red; color: #fff; padding: 5px;" class="mainNewcat" size="1">
    <option>My New List</option>
    <option value="http://www.google.com/value1.html">Value 1</option>
    <option value="http://www.google.com/value2.html">Value 2</option>
    <option value="http://www.google.com/value3.html">Value 3</option>
</select>

<select style="background: green; color: #fff; padding: 5px;" class="mainOldcat" size="1">
    <option>My Old List</option>
    <option value="http://www.yahoo.com/cat1.html">Category 1</option>
    <option value="http://www.yahoo.com/cat2.html">Category 2</option>
    <option value="http://www.yahoo.com/cat3.html">Category 3</option>
</select>
"""
soup = BeautifulSoup(data, "html.parser")

for option in soup.select("select.mainNewcat > option[value]"):
    print(option["value"], option.text)  # hiding the important link here: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454

Prints:

(u'http://www.google.com/value1.html', u'Value 1')
(u'http://www.google.com/value2.html', u'Value 2')
(u'http://www.google.com/value3.html', u'Value 3')

Here we are using CSS selectors to match the option elements, having a value attribute, located directly inside a select element having "mainNewcat" class.


Just FYI, we can look at the problem from a different angle - first locate the option with "My New List" text and then look into next option siblings:

my_new_list_option = soup.find("option", text="My New List")
for option in soup.find_next_siblings("option", value=True):
    print(option["value"], option.text)
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
  • Crap. Yours would be the ideal solution but I wanted to stay away from beautifulsoup as it causes performance issues on lower end devices like (Rpi). Same devices have way better performance with regex than beautifulsoup. – Malhar Oct 28 '15 at 13:51
  • @user1819085 well, we can work on making it work faster if this is the case. We can start with using `lxml` as an underlying parser: `BeautifulSoup(data, "lxml")` or using soup strainer classes to parse only relevant parts of pages. – alecxe Oct 28 '15 at 13:52