0
text=u’<a href="#5" accesskey="5"></a><a href="#1" accesskey="1"><font color="#667755">\ue689</font></a><a href="#2" accesskey="2"><font color="#667755">\ue6ec</font></a><a href="#3" accesskey="3"><font color="#667755">\ue6f6</font></a>‘ 

I am a python new hand. I wanna get \ue6ec、\ue6f6、\ue6ec,how to fetch these string use re module. Thank you very much!

4 Answers4

2

Regexp is not good tool to work with HTML. Use the Beautiful Soup.

ceth
  • 44,198
  • 62
  • 180
  • 289
2
>>> from BeautifulSoup import BeautifulSoup
>>> text=u'<a href="#5" accesskey="5"></a><a href="#1" accesskey="1"><font color="#667755">\ue689</font></a><a href="#2" accesskey="2"><font color="#667755">\ue6ec</font></a><a href="#3" accesskey="3"><font color="#667755">\ue6f6</font></a>'
>>> t = BeautifulSoup(text)
>>> t.findAll(text=True)
[u'\ue689', u'\ue6ec', u'\ue6f6']
Kimvais
  • 38,306
  • 16
  • 108
  • 142
  • And for reference, that produces `u'\ue689\ue6ec\ue6f6'`. – Chris Morgan Nov 26 '10 at 07:14
  • The lastest BeautifulSoup-3.0.0.py, there is not have getText() method,how to use it.Thank you . – user521023 Nov 26 '10 at 07:26
  • 1
    Oops, did not notice - fixed now (and this is actually better since now you don't have to split it - if you want them in a single string, do `''.join(t.findAll(text=True)` – Kimvais Nov 26 '10 at 07:34
1

Don't use regular expressions to parse HTML. Use BeautifulSoup. Documentation for BeautifulSoup.

Community
  • 1
  • 1
user225312
  • 126,773
  • 69
  • 172
  • 181
0

If you know that the page will always have that format, use BeautifulSoup parser to find what you need in HTML.

However, sometimes BeautifulSoup may break due to malformed HTML. I'd suggest you to use lxml which is python binding of libxml2. It will parse and usually correct the malformed HTML.

Utku Zihnioglu
  • 4,714
  • 3
  • 38
  • 50