0

Say I have many options in a HTML page (opened as text file) as below,

<select id="my">
  <option id="1">1a</option>
  <option id="2">2bb</option>     
</select>

<select id="my1">
  <option id="11">11a</option>
  <option id="21">21bb</option>     
</select>

Now, I've searched for <select id=

with open('/u/poolla/Downloads/creat/xyz.txt') as f:
for line in f:
    line = line.strip()
    if '<select id=' in line:
        print "true"

Now, whenever <select id= occurs, I want to get the id value. that is, copy the string from " after id= till another " occurs

how do I do this in python?

  • 7
    Please! `BeautifulSoup`: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags – sshashank124 Apr 09 '14 at 12:34
  • 1
    Or lxml, if you want a less awful parser. :P – Wooble Apr 09 '14 at 12:36
  • `re.findall('id=".*?"', line)[0][4:-1]` yw... – Torxed Apr 09 '14 at 12:54
  • 1
    @Wooble: You do know that BeautifulSoup uses pluggable parsers and that `lxml`, if installed, is the default, right? BeautifulSoup 4 is *not about parsing* (anymore) but about the object model. Which is pretty neat for most HTML tasks, really. – Martijn Pieters Apr 09 '14 at 12:57
  • @Wooble: Use `lxml` if you want to use the ElementTree-on-steroids object model instead. Don't pick it because you think the parser might be better... – Martijn Pieters Apr 09 '14 at 12:58

2 Answers2

3

An html parser library is usually better at parsing html than raw string functions or regular expressions. Here's an example with the standard HTMLParser class:

html = """
<select id="my">
  <option id="1">1a</option>
  <option id="2">2bb</option>
</select>

<select id="my1">
  <option id="11">11a</option>
  <option id="21">21bb</option>
</select>
"""

from HTMLParser import HTMLParser

class MyParser(HTMLParser):
    def __init__(self):
        HTMLParser.__init__(self)
        self.ids = []

    def handle_starttag(self, tag, attrs):
        if tag == 'select':
            self.ids.extend(val for name, val in attrs if name == 'id')


p = MyParser()
p.feed(html)
print p.ids  # ['my', 'my1']
gog
  • 10,367
  • 2
  • 24
  • 38
0

BeautifulSoup4 has a very useful select method which makes possible to parse an html document with css selectors

Something like the following code (not tested sorry :-) ), should make possible to get all the ids of the select tags of an html document.

from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
tags = soup.select("select")
print [t.get("id", None) t for t in tags]
luc
  • 41,928
  • 25
  • 127
  • 172