Regular Expression non-ASCII character

Question

I'm having a bit of trouble with a regular expression in python. The html string is:

html = <td style="padding-right:5px;">
<span class="blackText">Above £ 7.00 = </span>
</td>
<td>
<span class="blackText">
<p>Free</p>
</span>
</td>

I want to extract the "7.00" and "Free", however the following does not work:

amount = re.findall(r'Above £ (.*?) =',html)

Python throws up a non-ASCII error for the £ symbol. How would I get around this? Thanks.

http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags — unddoch, Nov 29 '12 at 19:48

score 5 · Accepted Answer · answered Nov 29 '12 at 18:50

5

amount = re.findall(r'Above \xC2 (.*?) =', html)

answered Nov 29 '12 at 18:50

Ωmega

How did you get `\xC2`? My Python seems to be using `\xa3` for the sterling symbol. – chrisaycock Nov 29 '12 at 18:52
1

@chrisaycock - depends on encoding. `\xa3` is html entity. `\xC2` is utf-8. see (http://www.fileformat.info/info/unicode/char/a3/index.htm) – Jay Walker Nov 29 '12 at 18:54
@JayWalker - OP has experienced error regarding non-ASCII character, so it will be utf-8 – Ωmega Nov 29 '12 at 18:59

1 Answers1