1

I have written the following spider for scraping the webmd site for patient reviews

from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector


class MySpider(BaseSpider):
    name = "webmd"
    allowed_domains = ["webmd.com"]
    start_urls = ["http://www.webmd.com/drugs/drugreview-92884-Boniva"]

    def parse(self, response):
        hxs = HtmlXPathSelector(response)
        titles = hxs.select("//p")
        title = titles.select("//p[contains(@class, 'comment')and contains(@style, 'display:none')]/text()").extract()
        print(title)

Executing this code gives me desired output but with a lot of duplication i.e. the same comments are repeated for at least 10 times. Help me to solve this issue.

2 Answers2

3

You can rewrite your spider code like this:

import scrapy

# Your Items 
class ReviewItem(scrapy.Item):
    review = scrapy.Field()


class WebmdSpider(scrapy.Spider):
    name = "webmd"
    allowed_domains = ["webmd.com"]
    start_urls = ['http://www.webmd.com/drugs/drugreview-92884-Boniva']

    def parse(self, response):
        titles = response.xpath('//p[contains(@id, "Full")]')
        for title in titles:
            item = ReviewItem()
            item['review'] = title.xpath('text()').extract_first()
            yield item

        # Checks if there is a next page link, and keeping parsing if True    
        next_page = response.xpath('(//a[contains(., "Next")])[1]/@href').extract_first()
        if next_page:
            yield scrapy.Request(response.urljoin(next_page), callback=self.parse)

It selects only full customer reviews without duplicate and saves them in Scrapy Items. Note: instead of HtmlXPathSelector you can use more convenient shortcut response. Also, I change deprecated scrapy.BaseSpider to scrapy.Spider.

For saving reviews to a csv format you can simply use Scrapy Feed exports and type in console scrapy crawl webmd -o reviews.csv.

vold
  • 1,549
  • 1
  • 13
  • 19
  • Glad I can help. Also, you better change deprecated `scrapy.BaseSpider` to `scrapy.Spider`. And for saving scraped reviews you can use [Scrapy Items](https://doc.scrapy.org/en/latest/topics/items.html). – vold Apr 15 '17 at 07:54
  • Can you help me with saving the reviews in a .csv file?? each comment in different cell. – Swapnil Joshi Apr 15 '17 at 09:02
  • Is there any way to limit the number of reviews to be scraped? – Swapnil Joshi Apr 15 '17 at 13:57
  • Page contains 153 reviews and shows 5 reviews per page. What numbers do you need from 153? – vold Apr 15 '17 at 14:03
  • I want just 100 reviews in total – Swapnil Joshi Apr 15 '17 at 14:12
  • You can see this [answer](http://stackoverflow.com/questions/35748061/how-to-stop-scrapy-spider-after-certain-number-of-requests) from similar question. – vold Apr 15 '17 at 14:30
2

You can use sets to get unique comments. I hope you know that the selector returns the result as a list so if you use sets then you ll get only unique results. So

def parse(self,response):
    hxs = HtmlXPathSelector(response)
    titles = hxs.select("//p")
    title = set(titles.select("//p[contains(@class, 'comment')and contains(@style, 'display:none')]/text()").extract())
    print (title) #this will have only unique results.
Mani
  • 5,401
  • 1
  • 30
  • 51