0

I'd like to write a regular expression for following type of strings in Pyhton:

1 100

1 567 865

1 474 388 346

i.e. numbers separated from thousand. Here's my regexp:

r"(\d{1,3}(?:\s*\d{3})*)

and it works fine. However, I also wanna parse

1 100,34848

1 100 300,8

19 328 383 334,23499

i.e. separated numbers with decimal digits. I wrote

rr=r"(\d{1,3}(?:\s*\d{3})*)(,\d+)?\s

It doesn't work. For instance, if I make

sentence = "jsjs 2 222,11 dhd"

re.findall(rr, sentence)

[('2 222', ',11')]

Any help appreciated, thanks.

Community
  • 1
  • 1
  • 1
    If that isn't the output you wanted, what output _did_ you want? Because that seems like exactly what you should be looking for (except maybe moving the `,` outside of the capture group). – abarnert Oct 23 '14 at 22:51
  • I got two tokens, "2 222" and ",11"". Answer should be "2 222,11". :) – Dogacan Sbdb Oct 23 '14 at 23:10
  • Please edit that into your question, not just in a comment. As it stands, your question—which is the only thing people will see if they're searching for someone to help or for help with a similar problem—doesn't make it clear what you're asking. – abarnert Oct 24 '14 at 00:12

3 Answers3

0

This works:

import re

rr=r"(\d{1,3}(?:\s*\d{3})*(?:,\d+)?)"
sentence = "jsjs 2 222,11 dhd"

print re.findall(rr, sentence) # prints ['2 222,11']
friedi
  • 4,350
  • 1
  • 13
  • 19
0

TL;DR: This regular expresion will print ['2 222,11 ']

r"(?:\d{1,3}(?:\s*\d{3})*)(?:,\d+)?"

The result of the search are expresions in parentheses except those starting (?: or whole expresion if the're aren't any subexpresion

So in your first regex it will match your string and return the whole expresion, since there aren't subexpressions (the only parenteses starts with (?:)

In the second it will find the string 2 222,11 and match it, then it looks at subexpresions ((\d{1,3}(?:\s*\d{3})*) and (,\d+), and will return tuple containing those: namely part before decimal comma, and the part after

So to fix your expresion, you'll need to either add to all parentheses ?: or remove them

Also the last \s is redundant as regexes always match as much characters as possible - meaning it will match all numbers after comma

Roukanken
  • 116
  • 1
  • 4
0

The only problem with your result is that you're getting two match groups instead of one. The only reason that's happening is that you're creating two capture groups instead of one. You're putting separate parentheses around the first half and the second half, and that's what parentheses mean. Just don't do that, and you won't have that problem.

So, with this, you're half-way there:

(\d{1,3}(?:\s*\d{3})*,\d+)\s

Regular expression visualization

Debuggex Demo

The only problem is that the ,\d+ part is now mandatory instead of optional. You obviously need somewhere to put the ?, as you were doing. But without a group, how do you do that? Simple: you can use a group, just make it a non-capturing group ((?:…) instead of (…)). And put it inside the main capturing group, not separate from it. Exactly as you're already doing for the repeated \s*\d{3} part.

(\d{1,3}(?:\s*\d{3})*(?:,\d+)?)\s

Regular expression visualization

Debuggex Demo

abarnert
  • 354,177
  • 51
  • 601
  • 671