Regular Expression to find comma separated numbers python

Question

I have some HTML in which I want to find string which contains comma-separated number like

871,174 Views (this could range from 1 to n with commas in it)

I tried many for example

'(\d+(,d+)*)\sViews'

but can't make it work because when I run

re.findall(r'(\d+(,d+)*)\sViews', string)

, it gives

[('174', '')]

Actually I want to get the number.

Edit 1: this is the string I'm passing to regex

<span class="fcg"><span id="fbPhotoPageCreatorInfo"></span></span><div class="mbs fbPhotosAudienceContainerNotEditable" id="fbPhotoPageAudienceSelector"><span class="mrs fbPhotosAudienceNotEditable fsm fwn fcg">Shared with:</span><div class="_6a _29ee _3iio _20nn _43_1" data-hover="tooltip" aria-label="Public" data-tooltip-alignh="center"><i class="img sp_e0NUBoHLxu_ sx_9486cc"></i><span class="_29ef">Public</span></div>&nbsp;</div><div></div><span class="fcg">871,174 Views</span>

[Don't use regex for this](http://stackoverflow.com/a/1732454/1519058)... It is more appropriate to use a dom parser like [`BeautifulSoup`](http://www.crummy.com/software/BeautifulSoup/) — Enissay, Jan 23 '15 at 10:19
Enissay I'm using BeautifulSoup right now, it takes more time then regular expression. So wanted to switch to it — vaibhav1312, Jan 23 '15 at 10:21
At least get the text from that node using BS then use regex to get the useful part you want — Enissay, Jan 23 '15 at 10:23
`re.findall("\d+",soup.find(attrs={"class":"fcg"},text=re.compile("\d+")).text)` — Padraic Cunningham, Jan 23 '15 at 10:28

Toto · Answer 1 · 2015-01-23T11:04:18.630

2

Except if it is a typo, you've omitted the backslash:

  '(\d+)(,\d+)*\sViews'
# here __^

Test:

>>> html = """<span class="fcg">871,174 Views</span>"""
>>> import re
>>> pattern = re.compile(r'(\d+)(?:,(\d+))*\sViews')
>>> matches = re.findall(pattern, html)
>>> print(matches)
[('871', '174')]

edited Jan 23 '15 at 11:04

answered Jan 23 '15 at 10:23

Toto

89,455
62
89
125

score 0 · Answer 2 · answered Jan 23 '15 at 10:09

0

(\d+(?:,d+)*)

Try this.This should work for you.

answered Jan 23 '15 at 10:09

vks

67,027
10
91
124

Padraic Cunningham · Answer 3 · 2015-01-23T11:08:58.997

If you don't want to get the text with BeautifulSoup and are going to use re don't search the whole string, rsplit on the class which if you worried about speed will be much faster:

html = """<span class="fcg"><span id="fbPhotoPageCreatorInfo"></span></span><div class="mbs fbPhotosAudienceContainerNotEditable" id="fbPhotoPageAudienceSelector"><span class="mrs fbPhotosAudienceNotEditable fsm fwn fcg">Shared with:</span><div class="_6a _29ee _3iio _20nn _43_1" data-hover="tooltip" aria-label="Public" data-tooltip-alignh="center"><i class="img sp_e0NUBoHLxu_ sx_9486cc"></i><span class="_29ef">Public</span></div>&nbsp;</div><div></div><span class="fcg">871,174 Views</span>"""

import re
print(re.findall(("\d+"),html.rsplit('class="fcg">',1)[1]))
['871', '174']

In [13]: timeit re.findall(("\d+"),html.rsplit('class="fcg">',1)[1])
100000 loops, best of 3: 3.21 µs per loop

In [14]: timeit matches = re.findall(pattern, html)
10000 loops, best of 3: 20.1 µs per loop

There is about the same chance of this breaking as any regex that is why you should be using beautifulSoup.

score 0 · Answer 4 · answered Jan 25 '15 at 04:05

import re

html = """<span class="fcg"><span id="fbPhotoPageCreatorInfo"></span></span><div class="mbs fbPhotosAudienceContainerNotEditable" id="fbPhotoPageAudienceSelector"><span class="mrs fbPhotosAudienceNotEditable fsm fwn fcg">Shared with:</span><div class="_6a _29ee _3iio _20nn _43_1" data-hover="tooltip" aria-label="Public" data-tooltip-alignh="center"><i class="img sp_e0NUBoHLxu_ sx_9486cc"></i><span class="_29ef">Public</span></div>&nbsp;</div><div></div><span class="fcg">871,174 Views</span>"""

p = re.compile(r"[\d\,]+(?=\sViews)")
print p.findall(html)

Regular Expression to find comma separated numbers python

4 Answers4