Python finding regex in a String

Question

I'm trying to find all cases of money values in a string called webpage.

String webpage is the text from this webpage, in my program it's just hardcoded because that's all that is needed, but I won't paste it all here.

regex = r'^[$£€]?(([\d]{1,3},([\d]{3},)*[\d]{3}|[0-9]+)(\.[0-9][0-9])?(\s?bn|\s?mil|\s?euro[s]?|\s?dollar[s]?|\s?pound[s]?|p){0,2})'
res = re.findall(regex, webpage)
print(res)

it's returning [], I expected it to return [$131bn, £100bn, $100bn, $17.4bn]

so, you should NOT parse a web-page with regex. There are other good proper tools — RomanPerekhrest, Dec 14 '17 at 14:33
My answer here (https://stackoverflow.com/a/37571199/2064981) might help you ;) — SamWhan, Dec 14 '17 at 14:33
There is _no way_ this regex will match anything as it matches only _at the beginning of the string_, because of `'^stuff'`. So it looks like you don't want to match _at the very beginning_ of the webpage. — ForceBru, Dec 14 '17 at 14:48
Your regex starts with the `^` anchor, which means it's only going to match a currency value that starts at the very beginning of the document. — glibdud, Dec 14 '17 at 14:48

JCJ · Accepted Answer · 2017-12-14T15:04:40.650

Without knowing the text it has to search, you could use the regex:

([€|$|£]+[0-9a-zA-Z\,\.]+)

to capture everything that contains €, £ or $, and then print the amount without following words or letters. See the example in action here: http://rubular.com/r/a7O7AGF9Zl.

Using this regex we get this code:

import re
webpage = '''
one 
million
dollars
test123
$1bn asd
€5euro
$1923,1204bn
€1293.1205 million'''
regex = r'([€|$]+[0-9a-zA-Z\,\.]+)'
res = re.findall(regex, webpage)
print(res)

with the output:

['$1bn', '€5euro', '$1923,1204bn', '€1293.1205']

EDIT: Using the same regex on the provided website, it returns the output of:

['$131bn', '$100bn', '$17.4bn.', '$52.4bn']

If you modify the regex further to find e.g. 500million, you can add 0-9 to your first bracket, as you then search for either £, €, $ or anything that starts with 0-9.

Output of:

webpage = '''
one 
million
€1293.1205 million
500million
'''
regex = r'([€|$0-9]+[0-9a-zA-Z\,\.]+)'

Therefore becomes:

['€1293.1205', '500million']

This works, if I wanted to be able to find something such as 500mil dollars, how would I adapt your regex. — Chaz, Dec 14 '17 at 15:00

score 0 · Answer 2 · answered Dec 14 '17 at 14:49

the first error on your regex is the ^ at the beginning of the string, which will only match the first character on the string, which isn't helpful when using findall.

Also you are defining a lot of groups (()) , that I assume you don't really need, so escape all of them (adding ?: next to the opened parenthesis) and you are going to get very close to what you want:

regex = r'[$£€](?:(?:[\d]{1,3},(?:[\d]{3},)*[\d]{3}|[0-9]+)(?:\.[0-9][0-9])?(?:\s?bn|\s?mil|\s?euro[s]?|\s?dollar[s]?|\s?pound[s]?|p){0,2})'
res = re.findall(regex, webpage)
print(res)

score 0 · Answer 3 · answered Dec 14 '17 at 14:59

A webscraping solution:

import urllib
import itertools
from bs4 import BeautifulSoup as soup
import re
s = soup(str(urllib.urlopen('http://www.bbc.com/news/business-41779341').read()), 'lxml')
final_data = list(itertools.chain.from_iterable(filter(lambda x:x, [re.findall('[€\$£][\w\.]+', i.text) for i in s.findAll('p')])))

Output:

[u'$131bn', u'\xa3100bn', u'$100bn', u'$17.4bn.']

Python finding regex in a String

3 Answers3