Python regex lookahead non-ASCII character

Question

I have most of this regex down, however I'm having trouble with a lookahead. I want to separate a string into a postcode, followed by two strings or two numbers. The numbers can be of the form:

The text for the middle bit can be "No minimum" and the text for the third bit can only be "Free".

E.g.

"YO1Â£ 10Free" ==> YO1; 10; Free

or

"yo1Â£ 8Â£ 0.5" ==> yo1; 8; 0.5

or

"yo1No minimumÂ£ 0.75" ==> yo1; No minimum; 0.75

I have the first bit done with this:

string = "YO1Â£ 10Free"
patternPostCode = re.compile("[a-zA-Z]{1,2}[0-9][a-zA-Z0-9]?")
postCode = re.findall(string,patternPostCode)

The figures in the string are found by:

patternCost = re.compile(r"(?<=\xa3 )([0-9]|  
[0-9][0-9]|  
[0-9]?[0-9]?.[0-9]|
[0-9]?[0-9]?.[0-9][0-9])")

I have difficulty adding the 'or text equals "No minimum"' to the patternCost search. I also can't manage to include the lookahead Â. Adding this at the end doesn't work:

(?<=\xc2)

Any help would be appreciated.

score 1 · Accepted Answer · edited May 23 '17 at 10:24

I came up with this on Python 2.7:

# -*- coding: utf-8 -*-
import re

raw_string = "YO1Â£ 10.01Free"
string = raw_string.decode('utf-8')
patternPostCode = re.compile(u"^(\w{3}.*)\s+(\d+\.?\d*)(\w+)$",flags=re.UNICODE)
postCode = patternPostCode.findall(string)

print postCode
print u'; '.join(postCode[0])

This returns:

[(u'YO1\xc2\xa3', u'10.01', u'Free')]
YO1Â£; 10.01; Free

First, the raw string I copied from SO appeared to be a bytestring, I had to decode it to unicode (see byte string vs. unicode string. Python). I think you may be having unicode encoding errors in general - the Â symbol is a classic telltale of that.

I then made your regex unicode-friendly, with the re.UNICODE flag. This means you can use \w to mean "alphanumeric" and \d to mean "digits" in a unicode-friendly way.

http://docs.python.org/2/library/re.html#module-re

Since regexes are often mistaken for line noise, lemme unpack for you:

u"^(\w{3}.*)\s+(\d+\.?\d*)(\w+)$"

^ = start of line
(\w{3}.*) = match exactly three alphanumeric chars (\w{3}), followed by anything (.*) and grouped (that's the parenthesis around the whole thing). I don't like the .* in general, but it was was necessary to grab the Â£ junk. If you don't want it, move it outside the parenthesis.
\s+ - at least one space. we'll throw this away
(\d+.?\d*) - match one or more digits, followed by an optional period, followed by optionally one or more digits. This'll match 10, 10., 10.0, 10.0000 and so on.
(\w+) - one or more alpha numeric chars
$ - match end of line

It's certainly not the prettiest regex I've ever written, but hopefully it's enough to get you started.

Thanks Rachel, that does help a lot. Regex are quite intimidating so thanks for explaining it! Cheers for the unicode tip, that has been wrecking my head! — eamon1234, Dec 04 '12 at 21:44
Woo! I'm glad it helped you. Unicode is a pain in the ass in Python 2, this Pycon video helped me start to grok it: http://www.youtube.com/watch?v=sgHbC6udIqc — Rachel Sanders, Dec 05 '12 at 18:53

Python regex lookahead non-ASCII character

1 Answers1