Scraping (Regex) Issues

Question

I've been trying to build a simple scraper that would take a keyword, then go to Amazon and enter the keyword into the search box, then scrape the main results only.

The problem is that the Regex isn't working. I've tried many different ways, but it's still not working properly.

   $url = "http://www.amazon.com/s/ref=nb_sb_noss?url=search-alias%3Daps&field-keywords=dog+bed&x=0&y=0";

$ch = curl_init($url);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, true);
$return = curl_exec($ch);
curl_close($ch);

preg_match_all('(<div.*class="data">.*<div class="title">.*<a.*class="title".*href="(.*?)">(.*?)</a>)', $return, $matches);

var_dump($matches);

Now Amazon's HTML code looks like this:

<div class="title">
<a class="title" href="https://rads.stackoverflow.com/amzn/click/com/B00063KG7S" rel="nofollow noreferrer">Midwest 40236 36-By-23-Inch Quiet Time Bolster Pet Bed, Fleece</a>
        <span class="ptBrand">by Midwest Homes for Pets</span>
 <span class="bindingAndRelease">(Nov 30, 2006)</span>
        </div>

I've tried to change the Regex a million different ways, but what I've learned over the past few months just isn't working, at all. Of course, if I just change it to href="(.*?)" - I get every link on there...but not when I add in the

Any advice would be appreciated!

score 1 · Answer 1 · edited Jan 18 '21 at 12:30

1

It might be important to note that questions like this that request help scraping copyright protected content are in violation of SO's Terms of Use, specifically the section on Subscriber Content which states:

"Subscriber represents, warrants and agrees that it will not contribute any Subscriber Content that (a) infringes, violates or otherwise interferes with any copyright or trademark of another party"

See https://meta.stackexchange.com/questions/93698/web-scraping-intellectual-property-and-the-ethics-of-answering/93701#93701 for an ongoing discussion of this issue.

edited Jan 18 '21 at 12:30

Community

1
1

answered Jun 03 '11 at 18:52

Rob Raisch

17,040
4
48
58

1

Guys, I'm just trying to learn PHP by scraping. I'm not trying to take down Amazon or anything. – 714sooner Jun 03 '11 at 18:57
I think the TOS belongs to the content you post. Telling how a regex can be written or telling how to use domdocument and xpath does not violate any copyright at all. – hakre Jun 03 '11 at 19:14
And if you know beforehand the advice you provide will be used to violate another's intellectual property rights, doesn't that fall under the TOS? (This would be a good addition to the discussion on Meta I reference above. Please consider adding it there.) – Rob Raisch Jun 03 '11 at 19:23
Rob, it sounds like you're assuming I'm going to do something bad here. Violating someone's intellectual property rights because I want to learn how to further my Regex knowledge by "scraping" content on my own machine? Not sure how well that would hold up in court... – 714sooner Jun 03 '11 at 21:55
1

Rob, no idea how high you place copyright, but it sounds like you have reminded your government about killing persons illegally in other countries, haven't you? Just asking. – hakre Jun 03 '11 at 22:10

score 1 · Accepted Answer · answered Jun 03 '11 at 19:27

Parsing complex structures with a regex often fails. The regex gets complicate and even you put lot of efforts in, it never properly works. That's by the nature of the data you would like to analyse and the limitation of regexes.

When website's weren't that complex, I did the following which often works well for a quick solution:

find a string that marks the beginning of the part that is interesting, cut everyhing out before. Then find a string that marks the end and cut out everything afterwards.

and then parse :)

nowadays if you need something flexible, write yourself a cache layer so you automatically can have a copy of the resources you need to scrape so you can code your scraper w/o the overhead to request external data all the time all over again while developing the right scraping strategy (it does not change that fast).

Then convert the HTML into XML for example with DomDocument in PHP. That works very well once you've done that two or three times. You might run in encoding problems and syntax problems, but those can be solved. And things got much better compared to some years ago.

Then you could step into Xpath which is quite flexible to run expressions on the XML.

But next to that there is a PHP lib that really is super-cool: FluentDOM.

It combines the best of DomDocument, XPath and PHP and is quite flexible.

Some examples & resources by the author of FluentDOM I can suggest:

Just found this answer, have fun: http://stackoverflow.com/questions/1732348/regex-match-open-tags-except-xhtml-self-contained-tags/1732454#1732454 — hakre, Jun 06 '11 at 22:58

score 0 · Answer 3 · edited May 23 '17 at 12:04

0

You should probably use an XML parser + XPath instead of a regexp to do that… XML + RE = bad idea

Plus, isn't doing what you intend to do it against Amazon Termes of Use?

edited May 23 '17 at 12:04

Community

1
1

answered Jun 03 '11 at 18:53

Thomas Hupkens

1,570
10
16

score 0 · Answer 4 · answered Jun 03 '11 at 18:56

0

I've not done this in PHP, but I've done similar things in Python. I suspect the correct approach is to use an HTML DOM parser like http://simplehtmldom.sourceforge.net/, which parses the HTML and turns it into objects for you to use.

answered Jun 03 '11 at 18:56

dsolimano

8,870
3
48
63

Scraping (Regex) Issues

4 Answers4