0

I have some HTML in which I want to find string which contains comma-separated number like

871,174 Views (this could range from 1 to n with commas in it)

I tried many for example

'(\d+(,d+)*)\sViews'

but can't make it work because when I run

re.findall(r'(\d+(,d+)*)\sViews', string)

, it gives

[('174', '')]

Actually I want to get the number.

Edit 1: this is the string I'm passing to regex

<span class="fcg"><span id="fbPhotoPageCreatorInfo"></span></span><div class="mbs fbPhotosAudienceContainerNotEditable" id="fbPhotoPageAudienceSelector"><span class="mrs fbPhotosAudienceNotEditable fsm fwn fcg">Shared with:</span><div class="_6a _29ee _3iio _20nn _43_1" data-hover="tooltip" aria-label="Public" data-tooltip-alignh="center"><i class="img sp_e0NUBoHLxu_ sx_9486cc"></i><span class="_29ef">Public</span></div>&nbsp;</div><div></div><span class="fcg">871,174 Views</span>
vaibhav1312
  • 863
  • 4
  • 13
  • 31
  • [Don't use regex for this](http://stackoverflow.com/a/1732454/1519058)... It is more appropriate to use a dom parser like [`BeautifulSoup`](http://www.crummy.com/software/BeautifulSoup/) – Enissay Jan 23 '15 at 10:19
  • Enissay I'm using BeautifulSoup right now, it takes more time then regular expression. So wanted to switch to it – vaibhav1312 Jan 23 '15 at 10:21
  • At least get the text from that node using BS then use regex to get the useful part you want – Enissay Jan 23 '15 at 10:23
  • `re.findall("\d+",soup.find(attrs={"class":"fcg"},text=re.compile("\d+")).text)` – Padraic Cunningham Jan 23 '15 at 10:28

4 Answers4

2

Except if it is a typo, you've omitted the backslash:

  '(\d+)(,\d+)*\sViews'
# here __^

Test:

>>> html = """<span class="fcg">871,174 Views</span>"""
>>> import re
>>> pattern = re.compile(r'(\d+)(?:,(\d+))*\sViews')
>>> matches = re.findall(pattern, html)
>>> print(matches)
[('871', '174')]
Toto
  • 89,455
  • 62
  • 89
  • 125
0
(\d+(?:,d+)*)

Try this.This should work for you.

vks
  • 67,027
  • 10
  • 91
  • 124
0

If you don't want to get the text with BeautifulSoup and are going to use re don't search the whole string, rsplit on the class which if you worried about speed will be much faster:

html = """<span class="fcg"><span id="fbPhotoPageCreatorInfo"></span></span><div class="mbs fbPhotosAudienceContainerNotEditable" id="fbPhotoPageAudienceSelector"><span class="mrs fbPhotosAudienceNotEditable fsm fwn fcg">Shared with:</span><div class="_6a _29ee _3iio _20nn _43_1" data-hover="tooltip" aria-label="Public" data-tooltip-alignh="center"><i class="img sp_e0NUBoHLxu_ sx_9486cc"></i><span class="_29ef">Public</span></div>&nbsp;</div><div></div><span class="fcg">871,174 Views</span>"""

import re
print(re.findall(("\d+"),html.rsplit('class="fcg">',1)[1]))
['871', '174']

In [13]: timeit re.findall(("\d+"),html.rsplit('class="fcg">',1)[1])
100000 loops, best of 3: 3.21 µs per loop

In [14]: timeit matches = re.findall(pattern, html)
10000 loops, best of 3: 20.1 µs per loop

There is about the same chance of this breaking as any regex that is why you should be using beautifulSoup.

Padraic Cunningham
  • 176,452
  • 29
  • 245
  • 321
0
import re

html = """<span class="fcg"><span id="fbPhotoPageCreatorInfo"></span></span><div class="mbs fbPhotosAudienceContainerNotEditable" id="fbPhotoPageAudienceSelector"><span class="mrs fbPhotosAudienceNotEditable fsm fwn fcg">Shared with:</span><div class="_6a _29ee _3iio _20nn _43_1" data-hover="tooltip" aria-label="Public" data-tooltip-alignh="center"><i class="img sp_e0NUBoHLxu_ sx_9486cc"></i><span class="_29ef">Public</span></div>&nbsp;</div><div></div><span class="fcg">871,174 Views</span>"""

p = re.compile(r"[\d\,]+(?=\sViews)")
print p.findall(html)
HxGRD
  • 1
  • 3