Extract class name in scrapy

Question

I am trying to scrape rating off of trustpilot.com.

Is it possible to extract a class name using scrapy? I am trying to scrape a rating which is made up of five individual images but the images are in a class with the name of the rating for example if the rating is 2 starts then:

<div class="star-rating count-2 size-medium clearfix">...

if it is 3 stars then:

<div class="star-rating count-3 size-medium clearfix">...

So is there a way I can scrape the class count-2 or count-3 assuming a selector like .css('.star-rating')?

You could combine it with an xpath like `response.css('.star-rating').xpath("@class").extract()` (not tested). — Jan, Feb 08 '18 at 18:32
Thanks, that returns `['star-rating count-4 size-medium clearfix']` which is close enough to get something working. But do you know if I can use xpath to only get the classes starting with `count-`? — Dan, Feb 08 '18 at 18:36
You could try: `response.css('.star-rating').xpath(".//[contains(@class, 'count-')]/@class").extract()` — Jan, Feb 08 '18 at 18:38
That errored, but this sort of hack works `response.css('.star-rating').xpath('./@class').extract()[0].split(' ')[1][-1]` — Dan, Feb 08 '18 at 18:40
Dan I'm fairly certain that xpath1 only operates on nodes in the dom. scrapy uses lxml which only implements xpath1. xpath2 has some nifty functions like matches, tokenize, and replace that you could use to directly get what you want. Otherwise Jan's answer is the best you will get — RabidCicada, Feb 08 '18 at 20:02

Jan · Accepted Answer · 2018-02-08T19:00:12.087

7

You could use a combination of both somewhere in your code:

import re

classes = response.css('.star-rating').xpath("@class").extract()
for cls in classes:
    match = re.search(r'\bcount-\d+\b', cls)
    if match:
        print("Class = {}".format(match.group(0))

edited Feb 08 '18 at 19:00

answered Feb 08 '18 at 18:44

Jan

42,290
8
54
79

Thanks, ended up combining the two answers to `response.css('.star-rating').xpath("@class").re(r'count-(\d)')[0]` – Dan Feb 09 '18 at 11:21

score 4 · Answer 2 · answered Feb 09 '18 at 00:17

4

You can extract rating directly using re_first() and re():

for rating in response.xpath('//div[contains(@class, "star-rating")]/@class').re(r'count-(\d+)'):
    print(rating)

answered Feb 09 '18 at 00:17

gangabass

10,607
2
23
35

Thanks, ended up combining the two answers to `response.css('.star-rating').xpath("@class").re(r'count-(\d)')[0]` – Dan Feb 09 '18 at 11:22
@Dan You'll get an exception on pages without rating (`[0]` will not work for `None`) – gangabass Feb 09 '18 at 12:05
Thanks, but it looks like 1 star is the lowest allowed. – Dan Feb 09 '18 at 16:07

score -2 · Answer 3 · answered Oct 17 '18 at 22:54

I had a similar question. Using scrapy v1.5.1 I could extract attributes of elements by name. Here is an example used on Lowes; I did the same with the class attribute

    for product in response.css('ul.product-cards-grid li.product-wrapper'):
        prod_href = p.css('li::attr(data-producturl)').extract()
        prod_name = p.css('li::attr(data-producttitle)').extract_first()
        prod_img  = p.css('li::attr(data-productimg)').extract_first()
        prod_id   = p.css('li::attr(data-productid)').extract_first()

Extract class name in scrapy

3 Answers3

Linked