scraping review and details from html source

Question

I was looking on one page with review. I tried to scrape review from page (though site provide API for the same).

I saw each review is embedded inside li tag. In li tag there are many other tags.

Inside, there is one div with class name review-wrapper which contains review with rate and review.

Is it possible to write script which consider all such container and scrape review, image (if exist), rate and date?

Is regex correct way to do this or is DOM suitable?

http://www.yelp.com/biz/franchino-san-francisco?start=80 - Page link

Here is the code snipper:

    <div class="review-wrapper">
           <div class="review-content">
        <div class="biz-rating biz-rating-very-large clearfix">
    <div itemprop="reviewRating" itemscope itemtype="http://schema.org/Rating">

    <div class="rating-very-large">
    <i class="star-img stars_5" title="5.0 star rating">
        <img alt="5.0 star rating" class="offscreen" height="303" src="http://s3-media3.ak.yelpcdn.com/assets/2/www/img/c2252a4cd43e/ico/stars/v2/stars_map.png" width="84">
    </i>
        <meta itemprop="ratingValue" content="5.0">
</div>


    </div>
        <span class="rating-qualifier">
        <meta itemprop="datePublished" content="2013-10-28">
    10/28/2013
</span>

</div>


            <p class="review_comment ieSucks" itemprop="description" lang="en">The reason I started a yelp account, was to write a review for Franchinos. This is my favorite restaurant in the city of San Francisco, and especially, North Beach. <br><br>Where do I start... I take every friend, family member and acquaintance to Franchinos in every opportunity I can. I am a Italy-nut and have been over three times - the mood + atmosphere is almost identical. It is a 100% family-run restaurant and you can taste the expertise and &#39;home-cooking&#39;. <br><br>Each time, I get a large bottle of wine (One time - they ran out of the wine I had ordered - and instead gave me a larger, more expensive bottle - same price), a wonderful pasta dish (Alfredo, carbonara.. etc.) and a Caesar salad.<br><br>Need I say more? Buenisimo. I look forward to the next time.. and the times after that again and again. <br><br>è perfetto!</p> 

</div>
<div class="review-footer clearfix">
               <div class="rateReview ufc-feedback clearfix" data-review-id="SnZ4Q97nJdR7a-fot-Slcw">
                <p class="review-intro review-message">
    Was this review &hellip;?
</p>

if possible, always go with `DOM` instead of a `regex`, if the html changes a bit the regex will fail. — Pedro Lobito, Apr 27 '14 at 18:39
I write regex to scrap the image, link etc from source code. No idea about dom much. Does DOM can loop through all such divs and collect data from each review? — user123, Apr 27 '14 at 18:41
@Tuga: I write regex to scrap the image, link etc from source code. No idea about dom much. Does DOM can loop through all such divs and collect data from each review? — user123, Apr 27 '14 at 18:49
Yes, check this answer: http://stackoverflow.com/questions/23305739/getting-the-href-attribute-and-text-of-certain-kind-of-links?answertab=active#tab-top — Pedro Lobito, Apr 27 '14 at 18:58

scraping review and details from html source

0 Answers0