Selecting a string in python

Question

Say I have many options in a HTML page (opened as text file) as below,

<select id="my">
  <option id="1">1a</option>
  <option id="2">2bb</option>     
</select>

<select id="my1">
  <option id="11">11a</option>
  <option id="21">21bb</option>     
</select>

Now, I've searched for <select id=

with open('/u/poolla/Downloads/creat/xyz.txt') as f:
for line in f:
    line = line.strip()
    if '<select id=' in line:
        print "true"

Now, whenever <select id= occurs, I want to get the id value. that is, copy the string from " after id= till another " occurs

how do I do this in python?

Please! `BeautifulSoup`: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags — sshashank124, Apr 09 '14 at 12:34
@Wooble: You do know that BeautifulSoup uses pluggable parsers and that `lxml`, if installed, is the default, right? BeautifulSoup 4 is *not about parsing* (anymore) but about the object model. Which is pretty neat for most HTML tasks, really. — Martijn Pieters, Apr 09 '14 at 12:57
@Wooble: Use `lxml` if you want to use the ElementTree-on-steroids object model instead. Don't pick it because you think the parser might be better... — Martijn Pieters, Apr 09 '14 at 12:58

score 3 · Accepted Answer · answered Apr 09 '14 at 13:01

An html parser library is usually better at parsing html than raw string functions or regular expressions. Here's an example with the standard HTMLParser class:

html = """
<select id="my">
  <option id="1">1a</option>
  <option id="2">2bb</option>
</select>

<select id="my1">
  <option id="11">11a</option>
  <option id="21">21bb</option>
</select>
"""

from HTMLParser import HTMLParser

class MyParser(HTMLParser):
    def __init__(self):
        HTMLParser.__init__(self)
        self.ids = []

    def handle_starttag(self, tag, attrs):
        if tag == 'select':
            self.ids.extend(val for name, val in attrs if name == 'id')


p = MyParser()
p.feed(html)
print p.ids  # ['my', 'my1']

score 0 · Answer 2 · answered Apr 09 '14 at 13:20

BeautifulSoup4 has a very useful select method which makes possible to parse an html document with css selectors

Something like the following code (not tested sorry :-) ), should make possible to get all the ids of the select tags of an html document.

from bs4 import BeautifulSoup
soup = BeautifulSoup(html)
tags = soup.select("select")
print [t.get("id", None) t for t in tags]

Selecting a string in python

2 Answers2