python how to fetch these string

Question

text=u’<a href="#5" accesskey="5"></a><a href="#1" accesskey="1"><font color="#667755">\ue689</font></a><a href="#2" accesskey="2"><font color="#667755">\ue6ec</font></a><a href="#3" accesskey="3"><font color="#667755">\ue6f6</font></a>‘

I am a python new hand. I wanna get \ue6ec、\ue6f6、\ue6ec,how to fetch these string use re module. Thank you very much!

wow, this fragment looks intentionally obfuscated. What does this actually come from? — SingleNegationElimination, Nov 26 '10 at 07:43

score 2 · Answer 1 · answered Nov 26 '10 at 07:09

2

Regexp is not good tool to work with HTML. Use the Beautiful Soup.

answered Nov 26 '10 at 07:09

ceth

44,198
62
180
289

Kimvais · Answer 2 · 2010-11-26T07:32:43.643

2

>>> from BeautifulSoup import BeautifulSoup
>>> text=u'<a href="#5" accesskey="5"></a><a href="#1" accesskey="1"><font color="#667755">\ue689</font></a><a href="#2" accesskey="2"><font color="#667755">\ue6ec</font></a><a href="#3" accesskey="3"><font color="#667755">\ue6f6</font></a>'
>>> t = BeautifulSoup(text)
>>> t.findAll(text=True)
[u'\ue689', u'\ue6ec', u'\ue6f6']

edited Nov 26 '10 at 07:32

answered Nov 26 '10 at 07:11

Kimvais

38,306
16
108
142

And for reference, that produces `u'\ue689\ue6ec\ue6f6'`. – Chris Morgan Nov 26 '10 at 07:14
The lastest BeautifulSoup-3.0.0.py, there is not have getText() method,how to use it.Thank you . – user521023 Nov 26 '10 at 07:26
1

Oops, did not notice - fixed now (and this is actually better since now you don't have to split it - if you want them in a single string, do `''.join(t.findAll(text=True)` – Kimvais Nov 26 '10 at 07:34

score 1 · Answer 3 · edited May 23 '17 at 12:29

1

Don't use regular expressions to parse HTML. Use BeautifulSoup. Documentation for BeautifulSoup.

edited May 23 '17 at 12:29

Community

1
1

answered Nov 26 '10 at 07:11

user225312

126,773
69
172
181

score 0 · Answer 4 · answered Nov 26 '10 at 14:52

If you know that the page will always have that format, use BeautifulSoup parser to find what you need in HTML.

However, sometimes BeautifulSoup may break due to malformed HTML. I'd suggest you to use lxml which is python binding of libxml2. It will parse and usually correct the malformed HTML.

python how to fetch these string

4 Answers4