0

How can I match the price in this string?

    <div id="price_amount" itemprop="price" class="h1 text-special">
      $58
    </div>

I want the $58 in this string, how to do that? This is what I am tring, but doesn't work:

    regex = r'<div id="price_amount" itemprop="price" class="h1 text-special">(.+?)</div>'
    price = re.findall(regex, string)
Liao Zhuodi
  • 3,144
  • 5
  • 26
  • 46
  • 1
    Refer to the answer [here](http://stackoverflow.com/questions/849912/python-regex-how-to-find-a-string-between-two-sets-of-strings) – Sawal Maskey Jun 11 '14 at 06:07

2 Answers2

2

You really should not use regex for this particular problem. Look into an XML/HTML parsing library for Python instead.

Having said that, your regex is just missing a match for the newlines, so you need to add \s* after the opening tag and before the closing tag.

import re

string="""
    <div id="price_amount" itemprop="price" class="h1 text-special">
      $58
    </div>
    """
regex = r'<div id="price_amount" itemprop="price" class="h1 text-special">\s*(.+?)\s*</div>'
price = re.findall(regex, string)
print price
merlin2011
  • 71,677
  • 44
  • 195
  • 329
  • You might want to use the non-greedy versions – thefourtheye Jun 11 '14 at 06:04
  • @thefourtheye, Actually, then you would match a whole bunch of extra whiteespace inside the capture, which I assume the OP doesn't want. – merlin2011 Jun 11 '14 at 06:08
  • The reason to use XML/HTML parsing, is it more accurate and fast? – Liao Zhuodi Jun 11 '14 at 06:27
  • @liaozd, It is faster, more reliable, and generally not as much of a headache. Regular expressions were not designed to parse XML. – merlin2011 Jun 11 '14 at 06:30
  • @liaozd, [Here](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags) is some entertaining and also good reference material on the matter. – merlin2011 Jun 11 '14 at 06:31
2

Try to capture only the price which was inbetween <div></div> tags,

import re
str=('<div id="price_amount" itemprop="price" class="h1 text-special">'
     '$58'
     '</div>')
regex = r'<div id="price_amount" itemprop="price" class="h1 text-special">([^<]*?)</div>'
price= re.search(regex, str)
price.group(1) # => '$58'

([^<]*?) this code will catch any character not of < zero or more times and stores the captured character into a group(group1).? followed by * means a non-greedy match.

Avinash Raj
  • 172,303
  • 28
  • 230
  • 274