7

I am trying to scrape rating off of trustpilot.com.

Is it possible to extract a class name using scrapy? I am trying to scrape a rating which is made up of five individual images but the images are in a class with the name of the rating for example if the rating is 2 starts then:

<div class="star-rating count-2 size-medium clearfix">...

if it is 3 stars then:

<div class="star-rating count-3 size-medium clearfix">...

So is there a way I can scrape the class count-2 or count-3 assuming a selector like .css('.star-rating')?

chancyWu
  • 14,073
  • 11
  • 62
  • 81
Dan
  • 45,079
  • 17
  • 88
  • 157
  • 1
    You could combine it with an xpath like `response.css('.star-rating').xpath("@class").extract()` (not tested). – Jan Feb 08 '18 at 18:32
  • Thanks, that returns `['star-rating count-4 size-medium clearfix']` which is close enough to get something working. But do you know if I can use xpath to only get the classes starting with `count-`? – Dan Feb 08 '18 at 18:36
  • You could try: `response.css('.star-rating').xpath(".//[contains(@class, 'count-')]/@class").extract()` – Jan Feb 08 '18 at 18:38
  • That errored, but this sort of hack works `response.css('.star-rating').xpath('./@class').extract()[0].split(' ')[1][-1]` – Dan Feb 08 '18 at 18:40
  • Otherwise please give a demo link. – Jan Feb 08 '18 at 18:45
  • Dan I'm fairly certain that xpath1 only operates on nodes in the dom. scrapy uses lxml which only implements xpath1. xpath2 has some nifty functions like matches, tokenize, and replace that you could use to directly get what you want. Otherwise Jan's answer is the best you will get – RabidCicada Feb 08 '18 at 20:02

3 Answers3

7

You could use a combination of both somewhere in your code:

import re

classes = response.css('.star-rating').xpath("@class").extract()
for cls in classes:
    match = re.search(r'\bcount-\d+\b', cls)
    if match:
        print("Class = {}".format(match.group(0))
Jan
  • 42,290
  • 8
  • 54
  • 79
  • Thanks, ended up combining the two answers to `response.css('.star-rating').xpath("@class").re(r'count-(\d)')[0]` – Dan Feb 09 '18 at 11:21
4

You can extract rating directly using re_first() and re():

for rating in response.xpath('//div[contains(@class, "star-rating")]/@class').re(r'count-(\d+)'):
    print(rating)
gangabass
  • 10,607
  • 2
  • 23
  • 35
-2

I had a similar question. Using scrapy v1.5.1 I could extract attributes of elements by name. Here is an example used on Lowes; I did the same with the class attribute

    for product in response.css('ul.product-cards-grid li.product-wrapper'):
        prod_href = p.css('li::attr(data-producturl)').extract()
        prod_name = p.css('li::attr(data-producttitle)').extract_first()
        prod_img  = p.css('li::attr(data-productimg)').extract_first()
        prod_id   = p.css('li::attr(data-productid)').extract_first()