Regex to parse Amazon snippet HTML tag

Question

I got these two snippets:

<a rel="nofollow" href="http://www.amazon.de/gp/product/B004DI7A5S/ref=as_li_tl?ie=UTF8&camp=1638&creative=6742&creativeASIN=B004DI7A5S&linkCode=as2&tag=webbigode-21">PFIFF Reitstrumpf kariert, grau/lila, 37-39, 100322-144-37</a><img src="http://ir-de.amazon-adsystem.com/e/ir?t=webbigode-21&l=as2&o=3&a=B004DI7A5S" width="1" height="1" border="0" alt="" style="border:none !important; margin:0px !important;" />

Second one:

<a rel="nofollow" href="http://www.amazon.de/gp/product/B004DI7A5S/ref=as_li_tl?ie=UTF8&camp=1638&creative=6742&creativeASIN=B004DI7A5S&linkCode=as2&tag=webbigode-21"><img border="0" src="http://ws-eu.amazon-adsystem.com/widgets/q?_encoding=UTF8&ASIN=B004DI7A5S&Format=_SL110_&ID=AsinImage&MarketPlace=DE&ServiceVersion=20070822&WS=1&tag=webbigode-21" ></a><img src="http://ir-de.amazon-adsystem.com/e/ir?t=webbigode-21&l=as2&o=3&a=B004DI7A5S" width="1" height="1" border="0" alt="" style="border:none !important; margin:0px !important;" />

(Note that they're similar, but the second one is slightly longer.)

From the first snippet I need the content of the href, from the Second I need the Content of the Image-Source.

This does not work:

$result = preg_match_all("/<img.*?src\s*=.*?>/",$_POST['bild'],$matches);

What should I do?

Maybe check out [this](http://stackoverflow.com/questions/3577641/how-do-you-parse-and-process-html-xml-in-php). — Lux, May 16 '16 at 16:45
The _snippets_ are identical, can you clarify what it is you're looking for? We don't usually play guessing games on SO. — , May 16 '16 at 16:51
@sln They are not identical. I just run an winmerge on then and they are pretty different. — Jorge Campos, May 16 '16 at 17:11
@JorgeCampos I see, there's some random text in the first one halfway through. The scroll bars are different lengths. (duh, how'd I not see that?) — Laurel, May 16 '16 at 17:13
@JorgeCampos - Well, are we supposed to guess if that makes a difference ? — , May 16 '16 at 17:15
@sln I always analyze the content of the question before comment something. If OP say that He have two snippets they must be different otherwise it wasn't a problem at all it is not a matter of WE to suppose. When trying to find a solution we would probably see the problem (if they were equal snippets as you said) and only then make a comment. That's is what I always try to do. Can't say by others. — Jorge Campos, May 16 '16 at 17:23
@JorgeCampos - If someone shows two strings he wants to be two different values where each string has the same components of what they want, it doesn't resolve to something that is clear when there is nothing in the post language, nor examples that relate in any way to a difference. I see people guess all the time with _If this is what you mean_, etc. I stopped doing that. That's what I always try to do. — , May 16 '16 at 17:28
@JorgeCampos I think that a lot of people didn't realize that (me included) when writing and testing their code. Fortunately for me, my answer does not change because of the slight difference because I am trying to consider different use cases already. (I have clarified the question, too now.) — Laurel, May 16 '16 at 17:30
@Laurel Yeah, I've see it on your answer and even give it a +1. — Jorge Campos, May 16 '16 at 17:31
Both samples contain `href` and `img`. The second one contains 2 _img_ tags, what if any is the difference, and how should that affect the regex? — , May 16 '16 at 17:36
I'm taking a guess of course, but this might work `(?s)(?:(?:(?<=\s)href\s*=\s*(['"])(.*?)\1|".*?"|'.*?'|[^>]*?)+>)(?<!/>)(?(2)|(?!)).*?|(?:(?<=\s)src\s*=\s*(['"])(.*?)\3|".*?"|'.*?'|[^>]*?)+>)(?(4)|(?!)))` but depending on your meaning, I can only guess. — , May 16 '16 at 17:58

Bijan · Answer 1 · 2016-05-16T17:01:34.637

Instead of using RegEx, you can use Simple HTML DOM to Parse HTML.

include 'simple_html_dom.php';

$html = str_get_html('<a rel="nofollow" href="http://www.amazon.de/gp/product/B004DI7A5S/ref=as_li_tl?ie=UTF8&camp=1638&creative=6742&creativeASIN=B004DI7A5S&linkCode=as2&tag=webbigode-21"><img border="0" src="http://ws-eu.amazon-adsystem.com/widgets/q?_encoding=UTF8&ASIN=B004DI7A5S&Format=_SL110_&ID=AsinImage&MarketPlace=DE&ServiceVersion=20070822&WS=1&tag=webbigode-21" ></a><img src="http://ir-de.amazon-adsystem.com/e/ir?t=webbigode-21&l=as2&o=3&a=B004DI7A5S" width="1" height="1" border="0" alt="" style="border:none !important; margin:0px !important;" />');
echo $html->find('a', 0)->href . PHP_EOL;
echo $html->find('img', 0)->src;

Laurel · Answer 2 · 2016-05-16T17:33:10.660

0

This one extracts the href (~36 steps):

<a(?:\s*(?!href)[^\s>]*)*\s*href=["']([^"']+)

This one extracts the src (~59 steps):

<img(?:\s*(?!src)[^\s>]*)*\s*src=["']([^"']+)

Tags are regular, and can be parsed by regexes fairly easily. Note that I am assuming that the attributes (href and src) are surrounded by quotes of either variety.

These regexes are pretty fast (they are faster than the other regex answers more than 10x). They may be faster than a full parser, in fact given all the optimizations in PCRE.

Essentially, my regexes are almost identical. They find the start of the tag <a, and see if there are any attributes after it. If the attributes are not the one you want, it's skipped (?:\s*(?!href)[^\s>]*)*. The one you want is captured \s*href=["']([^"']+)["'].

edited May 16 '16 at 17:33

answered May 16 '16 at 16:55

Laurel

5,965
14
31
57

I removed the RegEx from my answer and replaced it with an HTML Parser. – Bijan May 16 '16 at 17:01
1

for 98% of use cases, the optimization won't matter and the simpler `.*?` is preferred (not knocking this faster regex just sayin) – Scott Weaver May 16 '16 at 17:02
@sweaver2112 It depends how much is being parsed. If you are parsing hundreds of webpages, then having a faster regex is worth it. – Laurel May 16 '16 at 17:05

Scott Weaver · Answer 3 · 2016-05-16T17:19:35.700

0

you can parse these values with a pretty simple regex, using the concept of non-greedy "dot" (.*?) Though the dot will match anything, it will only consume one char at a time, and then let the rest of the pattern (the double quote delimiters) match. You can add some named groups for readability and results access:

href="(?<href>.*?)"|src="(?<imgsrc>.*?)" //global

as Laurel has noted, this reduction in complexity comes at the cost of execution speed. The trade-off depends on your use case.

regex demo

edited May 16 '16 at 17:19

answered May 16 '16 at 17:12

Scott Weaver

7,192
2
31
43

The snippets are different, FYI. – Laurel May 16 '16 at 17:16

Regex to parse Amazon snippet HTML tag

3 Answers3

Linked