Regex "|" issues

Question

I am trying to get some data from Amazon and I'm using preg_match to find the elements that I need. However, I'm running into issues.

I combine two statements so if it doesn't find one it looks for the other. I believe unless the product is not listed one of those things will always exist.

So what its doing is looking for shipping cost. If its not there is looks for the "FREE Shipping" text.

preg_match_all('/(& <b>(.*?)<|<span class="olpShippingPrice">(.*?)<)/',$results,$match1);

If I run this I get the data I want but it's grabbing some HTML that would NOT be grabbed if I ran this in two seperate preg_matches. I cannot figure out how to show it but it's grabbing the bold tag on the first 'FREE Shipping' and all text below that is bold. You can see the carrots also.

  [1]=>
   array(10) {
     [0]=>
     string(38) "$30.00<"
     [1]=>
     string(37) "$6.99<"
     [2]=>
     string(37) "$6.99<"
     [3]=>
     string(38) "$53.99<"
     [4]=>
     string(37) "$5.25<"
     [5]=>
     string(19) "& FREE Shipping<"
     [6]=>
     string(19) "& FREE Shipping<"
     [7]=>
     string(19) "& FREE Shipping<"
     [8]=>
     string(19) "& FREE Shipping<"
     [9]=>
     string(38) "$70.39<"
   }

So my question: What must I do to remove the tags and the carrots from this so I am left with clean data? Also, running these in two separate preg_match's doesn't work for me.

@smack-a-bro Because parsing HTML with regex is bad and the answer on the linked question is a warning to be heeded. ***Especially*** when you don't control the source HTML. — Niet the Dark Absol, Oct 31 '14 at 15:08
@NiettheDarkAbsol I believe it's more that you're being pointed to "do not parse html with regex", and the humorous answer — will, Oct 31 '14 at 15:15
Close voters. Please **do not vote to close as a duplicate of the famous question**. It's not constructive and not helpful. — Madara's Ghost, Oct 31 '14 at 16:39

score 1 · Accepted Answer · answered Oct 31 '14 at 15:10

1

Without seeing your sample text, it's hard to know exactly what you need. But the main thing you need to do is take those "unwanted" characters out of the capture group; then use the capture group as your clean data:

preg_match_all('/(?:& <b>|<span class="olpShippingPrice">)(.*?)</',$results,$match1);

answered Oct 31 '14 at 15:10

Brian Stephens

5,161
19
25

This is it. Thanks a ton. Since reading that I should parse HTML with RegEx then what should be used? – smack-a-bro Oct 31 '14 at 15:13
@smack-a-bro http://php.net/domdocument is my preferred method. – Niet the Dark Absol Oct 31 '14 at 15:18

Regex "|" issues

1 Answers1