0

I need help selecting the price of a html code. As I have extracted the Title of a movie, I now need to extract the price. I have tried using the lookahead regular expression but I get an error when I use \n.* as it says "A quantifier inside a lookbehind makes it non-fixed width". I need the first and the second price in the text.

Regex I have tried:

(?<=Hello<\/a>.*\n.*\n.*\n.*\n.*\n.*\n.*\n.*\n.*?(\$)

and:

Hello<\/a>.*\n.*\n.*\n.*\n.*\n.*\n.*\n.*\n.*?(\$)

But doesn't work.

Text:

<a class="blue_link" href="http://www.ebgames.com.au/Games/sjbeiub108723">Hello:</a>
    <div class="hi">
        <p>Including <a class="blue_link"> 
<p>Price$<data1>40.00</p>

Pls help and thank you :)

Kevin
  • 47
  • 5
  • Is your expected output - `$30.53 and $27.46` – akash karothiya May 18 '17 at 06:01
  • If you want to parse HTML use a HTML parser. RegEx is not a HTML parser and should not be used for parsing HTML. See http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags for more information. – Fabian S. May 18 '17 at 06:57

1 Answers1

0

You can use this regex with the DOTALL flag:

import re

r = "The Durrells: Series 2.+\$(\d+\.\d+).+\$(\d+\.\d+)"

text = ''' <a class="blue_link fn url" href="http://www.fishpond.com.au/Movies/Durrells-Series-2-Keeley-Hawes/5014138609450">The Durrells: Series 2</a>
    <div class="by">
        <p>Starring <a class="blue_link" href="http://www.fishpond.com.au/c/Movies/s/Keeley+Hawes">Keeley Hawes</a>, <a class="blue_link" href="http://www.fishpond.com.au/c/Movies/s/Milo+Parker">Milo Parker</a>, <a class="blue_link" href="http://www.fishpond.com.au/c/Movies/s/Josh+O%27Connor">Josh O'Connor</a>, <a class="blue_link" href="http://www.fishpond.com.au/c/Movies/s/Daisy+Waterstone">Daisy Wat...</a></p>
        <div class="productSearch-metainfo">
DVD (UK), May 2017        </div>
    </div>
</div></td>
                    <td align="right" style="vertical-align:top;"><div class="productSearch-price-container">
<span class="rrp-label">Elsewhere</span>&nbsp;<s>$30.53</s>&nbsp;&nbsp;<span class="productSpecialPrice"><b>$27.46</b></span>&nbsp;<div style="white-space:nowrap;">&nbsp; &nbsp;<span class="you_save">Save 10%</span>&nbsp;</div><span class="free-shipping">with Free Shipping!</span></div>
'''

print(re.findall(r, text, re.DOTALL))

Output:

[('30.53', '27.46')]
Taku
  • 31,927
  • 11
  • 74
  • 85