0

Can you please provide me with a regular expression to output a result from a duplicated price? This could be generic meaning for any anything within the html tags not just price.

i.e. this is the rule to scrape the price:

<strong class="bigprice">(.+?)</strong>

Output of rule: "£4.99" "£4.99" (as you can see the result is duplicated due to the fact that in the source code there are two of the same tags followed by results.)

I only want the first result to show and not both, is there anyway of doing this in regular expressions?

Toto
  • 89,455
  • 62
  • 89
  • 125
  • 1
    Can there be lines such as `£1.00 £2.00 £2.00 £1.00 £1.00`? If yes, regexes are pretty much a lost cause... – fge Jun 11 '13 at 13:51
  • Also, what language are you using? And since this is HTML, have you seen [this](http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags)? – fge Jun 11 '13 at 13:53
  • 1
    Since you mention that it's html, why not use a html dom parser ? In what language are you writing your code ? – cb0 Jun 11 '13 at 13:56
  • 2
    We're here to help you write your own code, not write it for you. – Dan Pichelman Jun 11 '13 at 14:08

1 Answers1

1

Description

Given a space delimited list of values like £1.00 £2.00 £2.00 £1.00 £1.00 you can validate if there are duplicates by using a negative look ahead to find a back referenced value. I also added a $ and sign to the character class to allow for multiple currency types. This will return the last instance of each value which essentially makes the output unique.

Regex: (?:\s|^)((?:£|$|\xC2|\xA3)\d+\.\d{1,2})(?=\s|$)(?!.*?\s\1(?=\s|$))

enter image description here

Input: £1.00 £2.00 £2.00 £1.00 £1.00 link to example

$matches Array:
(
    [0] => Array
        (
            [0] =>  £2.00
            [1] =>  £1.00
        )

    [1] => Array
        (
            [0] => £2.00
            [1] => £1.00
        )

)

So we can carry this idea a step further to include your test expression <strong class="bigprice">(.+?)</strong> to prevent a duplicate value of (.+?). Since this looks like html I'm going to replace .+? which matches all characters with [^<]* which will match all characters upto the next open angle bracket

Regex: (?:<strong\s(?=[^>]*class="bigprice")[^>]*>)\s*((?:£|$|\xC2|\xA3)\d+\.\d{1,2})\s*<\/strong>(?!.*?(?:<strong\s(?=[^>]*class="bigprice")[^>]*>)\s*\1\s*<\/strong>)

enter image description here

Input: <strong class="bigprice">£1.00</strong><strong class="bigprice">£2.00</strong><strong class="bigprice">£1.00</strong> link to example

$matches Array:
(
    [0] => Array
        (
            [0] => <strong class="bigprice">£2.00</strong>
            [1] => <strong class="bigprice">£1.00</strong>
        )

    [1] => Array
        (
            [0] => £2.00
            [1] => £1.00
        )

)

Summary

In both cases the expression will fail if there are duplicate values found in the input text.

Ro Yo Mi
  • 14,790
  • 5
  • 35
  • 43
  • Nice! Where did you get the graphs from? – fge Jun 11 '13 at 15:59
  • Also, why the lazy quantifier in the lookahead? AFAICS, a normal, greedy quantifier will work equally well... – fge Jun 11 '13 at 16:00
  • @ fge, I'm using debuggex.com. Although it doesn't support lookbehinds or atomic groups it's still handy for understanding the expression flow. There is also regexper.com. They do a pretty good job too, but it's not real time as you're typing. – Ro Yo Mi Jun 11 '13 at 16:00
  • Hmm, if lookbehinds are not supported, it probably means this is a JavaScript library, since JS regexes have no support for lookbehinds... – fge Jun 11 '13 at 16:01
  • the lazy quantifier forces the expression to stop at the first match instead of looking for the last match in the string. I'm sure both lazy and non lazy will work, it's just personal preference and makes more sense in my head. – Ro Yo Mi Jun 11 '13 at 16:01
  • Yeah I'm sure it's a JS system, I've talked to the dev/owner and he does have grand plans on expanding into other languages. – Ro Yo Mi Jun 11 '13 at 16:03
  • Heh, I have in the corner of my head a multilanguage regex analyzer written in Java using Parboiled... – fge Jun 11 '13 at 16:04
  • let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/31597/discussion-between-denomales-and-fge) – Ro Yo Mi Jun 11 '13 at 16:05
  • Tried both rules and don't work, still showing duplicate values, is there anyway this can be prevented by only showing the first value and not the second even though there are duplicate values. I'm using HTML and normal regular expressions. i.e. if I use the old regex rule for price: (.+?)<\/strong>. The result will be: £4.99 and £4.99 again. I only want to capture £4.99 once and not twice. Is there a regex for this? Please let me know. Hope this clears up things. – Paul Lamptey Jun 18 '13 at 10:00
  • can you include some sample text or hopefully the exact text being tested? – Ro Yo Mi Jun 18 '13 at 12:10
  • I updated the expressions to capture the character codes \xC2 \xA3 and expanded the open and close tag captures to be a bit more robust. The answer also now includes sample input and results. – Ro Yo Mi Jun 18 '13 at 12:48
  • link to site I'm scrapping: http://www.lovell-rugby.co.uk/Rugby-Accessories/Kinetica/Kinetica-Large-Clear-Shaker-Bottle Rule for scrapping Price: (.+?)<\/strong> You need to use chrome and press F12 to see html source for Price, select element tool and click on Price "4.99". Download Regex Scraper 1.1.1 and insert rule above. You will see that Price is duplicated. I only want to display the first duplicated price and not both. Is there a regex rule for this?. Please let me know. – Paul Lamptey Jun 18 '13 at 12:55
  • This goes for Product Code and Brand. Thanks – Paul Lamptey Jun 18 '13 at 12:56
  • Whereas the example text works, your example page encodes the monetary symbol as `£` and not `£`. You can add the new string `£` to `(?:£|$|\xC2|\xA3)` and it'll work as expected. See Link: http://www.rubular.com/r/pplEk6I5rT (scroll all the way to the bottom to see the matches) – Ro Yo Mi Jun 18 '13 at 12:57
  • Ok, but it is still showing multiple prices in scrapper, so not getting it. Have you tried this in the scrapper in chrome cause it doesn't work. thanks – Paul Lamptey Jun 18 '13 at 13:49
  • Your issue is suffering from scope creep. I don't know anything about scrapper in chrome. – Ro Yo Mi Jun 18 '13 at 15:02