1

I'm trying to find all cases of money values in a string called webpage.

String webpage is the text from this webpage, in my program it's just hardcoded because that's all that is needed, but I won't paste it all here.

regex = r'^[$£€]?(([\d]{1,3},([\d]{3},)*[\d]{3}|[0-9]+)(\.[0-9][0-9])?(\s?bn|\s?mil|\s?euro[s]?|\s?dollar[s]?|\s?pound[s]?|p){0,2})'
res = re.findall(regex, webpage)
print(res)

it's returning [], I expected it to return [$131bn, £100bn, $100bn, $17.4bn]

Chaz
  • 195
  • 4
  • 17

3 Answers3

2

Without knowing the text it has to search, you could use the regex:

([€|$|£]+[0-9a-zA-Z\,\.]+)

to capture everything that contains €, £ or $, and then print the amount without following words or letters. See the example in action here: http://rubular.com/r/a7O7AGF9Zl.

Using this regex we get this code:

import re
webpage = '''
one 
million
dollars
test123
$1bn asd
€5euro
$1923,1204bn
€1293.1205 million'''
regex = r'([€|$]+[0-9a-zA-Z\,\.]+)'
res = re.findall(regex, webpage)
print(res)

with the output:

['$1bn', '€5euro', '$1923,1204bn', '€1293.1205']

EDIT: Using the same regex on the provided website, it returns the output of:

['$131bn', '$100bn', '$17.4bn.', '$52.4bn']

If you modify the regex further to find e.g. 500million, you can add 0-9 to your first bracket, as you then search for either £, €, $ or anything that starts with 0-9.

Output of:

webpage = '''
one 
million
€1293.1205 million
500million
'''
regex = r'([€|$0-9]+[0-9a-zA-Z\,\.]+)'

Therefore becomes:

['€1293.1205', '500million']
JCJ
  • 303
  • 3
  • 13
  • This works, if I wanted to be able to find something such as 500mil dollars, how would I adapt your regex. – Chaz Dec 14 '17 at 15:00
  • I have updated my answer with a potential solution to that. – JCJ Dec 14 '17 at 15:05
0

the first error on your regex is the ^ at the beginning of the string, which will only match the first character on the string, which isn't helpful when using findall.

Also you are defining a lot of groups (()) , that I assume you don't really need, so escape all of them (adding ?: next to the opened parenthesis) and you are going to get very close to what you want:

regex = r'[$£€](?:(?:[\d]{1,3},(?:[\d]{3},)*[\d]{3}|[0-9]+)(?:\.[0-9][0-9])?(?:\s?bn|\s?mil|\s?euro[s]?|\s?dollar[s]?|\s?pound[s]?|p){0,2})'
res = re.findall(regex, webpage)
print(res)
eLRuLL
  • 18,488
  • 9
  • 73
  • 99
0

A webscraping solution:

import urllib
import itertools
from bs4 import BeautifulSoup as soup
import re
s = soup(str(urllib.urlopen('http://www.bbc.com/news/business-41779341').read()), 'lxml')
final_data = list(itertools.chain.from_iterable(filter(lambda x:x, [re.findall('[€\$£][\w\.]+', i.text) for i in s.findAll('p')])))

Output:

[u'$131bn', u'\xa3100bn', u'$100bn', u'$17.4bn.']
Ajax1234
  • 69,937
  • 8
  • 61
  • 102