2

I am using python regex to find all prices in a string. Thus far I am only having trouble managing the symbols correctly. This code, with the input: 'happy$37.54000happy$34$3454$3333€27.80€3.00.33.2£27.000'

   import sys
   import re
   price = sys.argv[1]
   new = re.findall(r'[\$\20AC\00A3]{1}\d+\.?\d{0,2}',price,re.UNICODE)
   for prices in new:
       print prices

ouputs:

$37.54
$34
$3454    
$3333

What I would like is:

$37.54
$34
$3454
$3333
€27.80
€3.00    
£27.00

If I add the euro sign into the code the file cannot compile given that it is not a unicode character. I was thinking that since 20AC is the unicode for the euro symbol and \00A3 is the unicode for the pound symbol that that would work, but it does not.

I believe that the issues lies in this part of the code:...

[\$\20AC\00A3]...

Any help would be greatly appreciated

EDIT FOR FUTURE PEOPLE - THIS IS THE BEST CODE ANSWER:

# -*- coding: utf-8 -*-
import sys
import re
price = sys.argv[1]
new = re.findall(r'[$€£]{1}\d+\.?\d{0,2}',price,re.UNICODE)
for prices in new:
    print prices
Rorschach
  • 3,684
  • 7
  • 33
  • 77
  • Somebody with more python-fu than me [can probably tell you](http://stackoverflow.com/q/1832893/505649) if something exposes the [Unicode character category Sc (Symbols/currency)](http://www.fileformat.info/info/unicode/category/Sc/index.htm) as a character class. – Ulrich Schwarz Jul 09 '15 at 17:00
  • is it safe to assume that a period will always be followed by two digits? – Jason L. Jul 09 '15 at 17:00
  • 1
    What if you change `r'` to `ur'`? – kirbyfan64sos Jul 09 '15 at 17:01
  • Thanks for the suggestion kirby, sadly it did not work. – Rorschach Jul 09 '15 at 17:04
  • Thanks Jason, that's a good point, I have changed it to [\$\20AC\00A3]{1}\d+\.?\d{2}? (this does not solve the problem but I think is better code in the long run) – Rorschach Jul 09 '15 at 17:06
  • I have some small suggestions to your updated code. You don't want `|` inside your `[ ]` brackets bc the square brackets already denote "any of these characters". I don't think you need to escape the dollar sign inside the square brackets either bc it's understood to be a character in this context, but that may depend on implementation. But my real suggestion is to group the period with the numbers behind it. Your current version, given `$5.happy` would return `$5.`. But if you group the period together and do something like `(\.\d{1,2})?` then it would return `$5` for that example. – Jason L. Jul 09 '15 at 18:36
  • I have made the first suggestion, but the second one (which uses grouping) results in the output looking like: .54 _ _ _ .80 .00 .00 .00 _ instead of $37.54 $34 $3454 $3333 �27.80 �3.00 �27.00 $2.00 $5. This is kind of output seems to happen whenever I use groups in regex on python and I still don't know why. – Rorschach Jul 09 '15 at 20:12

3 Answers3

5

Here is a regex that matches your examples.

[$€£]\d+(\.\d{2})?

It's worth noting that I'm making the assumption that a period will be followed by two numbers. So this would match 3.50 but ignore 3.5. If that behavior isn't desired, you want to adjust the regex to

[$€£]\d+(\.\d{1,2})?

which would pick up 3.5 in my example.

Jason L.
  • 1,125
  • 11
  • 19
  • I have tried something similar, but I run into this error: SyntaxError: Non-ASCII character '\xe2' .... Have you imported something that knows how to handle the €£ characters? Or did you use some flag beyond my re.UNICODE flag? – Rorschach Jul 09 '15 at 17:08
  • I don't have any experience with character encoding issues in python but maybe this answer will help? http://stackoverflow.com/a/24221963/3442448 – Jason L. Jul 09 '15 at 17:19
  • That works! Thank you. (adding this at the top of the file: # -*- coding: utf-8 -*- ) – Rorschach Jul 09 '15 at 17:22
  • Nice! Never underestimate the power of googling error messages. :) – Jason L. Jul 09 '15 at 17:31
2

You need to add \u for your unicode character codes in your regex. i.e

new = re.findall(ur'[\$\u20AC\u00A3]{1}\d+\.?\d{0,2}',string,re.UNICODE)

https://docs.python.org/2/tutorial/introduction.html#unicode-strings

CentAu
  • 10,660
  • 15
  • 59
  • 85
  • Thank you, this answers the symbol question. It is not perfect code as it ouputs: $37.54 000 $34 $3454 $3333 27.80 00.33 27.00 ,But it answers the question. Thank you – Rorschach Jul 09 '15 at 17:13
  • I fixed it. Now it matches the characters as well. You need to specify your regex in unicode format (with ``u`` prefix). – CentAu Jul 09 '15 at 17:27
1

I can match with the symbols directly

[\$|€|£\20AC\00A3]{1}\d+.?\d{0,2}

http://pythex.org

Jason K
  • 1,406
  • 1
  • 12
  • 15
  • Oddly, my command line outputs: SyntaxError: Non-ASCII character '\xe2' . But thank you for the feedback, this website will be helpful in the future. (I am sorry that I cannot upvote yet on this site) – Rorschach Jul 09 '15 at 17:16