0

I am trying to scrap tshirt price from the following link : https://www.adidas.com/us/search?q=tshirt

from that link I look at the line where it says

<div class="gl-price-item gl-price-item--sale notranslate">$36</div>

This is what I did, and get

>>> fetch('https://www.adidas.com/us/search?q=tshirt')
2022-09-25 23:50:11 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.adidas.com/us/search?q=tshirt> (referer: None)
>>> response.css('div.gl-price-item.gl-price-item--sale.notranslate')
[]

I'd expect to get at least 1 item returned from response.css('div.gl-price-item.gl-price-item--sale.notranslate') because gl-price-item.gl-price-item--sale.notranslate has an entry of $36, but I am getting a blank array. Why is this happening?

what am I doing wrong here?

Andy Ray
  • 30,372
  • 14
  • 101
  • 138
Redshoe
  • 125
  • 5
  • Please read [how to ask](https://stackoverflow.com/help/how-to-ask) before asking additional questions, and edit this question to make it appropriate for Stackoverflow. – Andy Ray Sep 26 '22 at 03:56
  • https://docs.scrapy.org/en/latest/topics/dynamic-content.html#topics-javascript-rendering – Andy Ray Sep 26 '22 at 04:04

1 Answers1

1

You are getting a blank array because data is loaded dynamicaly via API . So you can't grab dynamic content cause scrapy can't render JS. But you can pull all the required data from API with the help of scrapy.

Example:

import scrapy
class TestSpider(scrapy.Spider):
    name = 'test'
    def start_requests(self):
        headers= {
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36'}

        api_url='https://www.adidas.com/api/plp/content-engine/search?sitePath=us&query=tshirt'
        
        yield scrapy.Request(
            url=api_url,
            headers=headers,
            callback= self.parse,
            method="GET")


    def parse(self, response):
        resp=response.json()
        
        for item in resp['raw']['itemList']['items']:
            yield {
                'price':item['price'],
                'salePrice':item['salePrice']
                }

Output:

{'price': 35, 'salePrice': 21}
2022-09-26 11:35:58 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.adidas.com/api/plp/content-engine/search?sitePath=us&query=tshirt>
{'price': 25, 'salePrice': 23}
2022-09-26 11:35:58 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.adidas.com/api/plp/content-engine/search?sitePath=us&query=tshirt>
{'price': 35, 'salePrice': 25}
2022-09-26 11:35:58 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.adidas.com/api/plp/content-engine/search?sitePath=us&query=tshirt>
{'price': 35, 'salePrice': 25}
2022-09-26 11:35:58 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.adidas.com/api/plp/content-engine/search?sitePath=us&query=tshirt>
{'price': 35, 'salePrice': 35}
2022-09-26 11:35:58 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.adidas.com/api/plp/content-engine/search?sitePath=us&query=tshirt>
{'price': 35, 'salePrice': 35}
2022-09-26 11:35:58 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.adidas.com/api/plp/content-engine/search?sitePath=us&query=tshirt>
{'price': 25, 'salePrice': 25}
2022-09-26 11:35:58 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.adidas.com/api/plp/content-engine/search?sitePath=us&query=tshirt>
{'price': 25, 'salePrice': 23}
2022-09-26 11:35:58 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.adidas.com/api/plp/content-engine/search?sitePath=us&query=tshirt>
{'price': 45, 'salePrice': 45}
2022-09-26 11:35:58 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.adidas.com/api/plp/content-engine/search?sitePath=us&query=tshirt>
{'price': 40, 'salePrice': 40}
2022-09-26 11:35:58 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.adidas.com/api/plp/content-engine/search?sitePath=us&query=tshirt>
{'price': 150, 'salePrice': 60}
2022-09-26 11:35:58 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.adidas.com/api/plp/content-engine/search?sitePath=us&query=tshirt>
{'price': 25, 'salePrice': 25}
2022-09-26 11:35:58 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.adidas.com/api/plp/content-engine/search?sitePath=us&query=tshirt>
{'price': 40, 'salePrice': 36}
2022-09-26 11:35:58 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.adidas.com/api/plp/content-engine/search?sitePath=us&query=tshirt>
{'price': 35, 'salePrice': 35}
2022-09-26 11:35:58 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.adidas.com/api/plp/content-engine/search?sitePath=us&query=tshirt>
{'price': 25, 'salePrice': 23}
2022-09-26 11:35:58 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.adidas.com/api/plp/content-engine/search?sitePath=us&query=tshirt>
{'price': 35, 'salePrice': 21}
2022-09-26 11:35:58 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.adidas.com/api/plp/content-engine/search?sitePath=us&query=tshirt>
{'price': 32, 'salePrice': 32}
2022-09-26 11:35:58 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.adidas.com/api/plp/content-engine/search?sitePath=us&query=tshirt>
{'price': 25, 'salePrice': 10}
2022-09-26 11:35:58 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.adidas.com/api/plp/content-engine/search?sitePath=us&query=tshirt>
{'price': 35, 'salePrice': 25}
2022-09-26 11:35:58 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.adidas.com/api/plp/content-engine/search?sitePath=us&query=tshirt>
{'price': 25, 'salePrice': 25}
2022-09-26 11:35:58 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.adidas.com/api/plp/content-engine/search?sitePath=us&query=tshirt>
{'price': 55, 'salePrice': 55}
2022-09-26 11:35:58 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.adidas.com/api/plp/content-engine/search?sitePath=us&query=tshirt>
{'price': 35, 'salePrice': 35}
2022-09-26 11:35:58 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.adidas.com/api/plp/content-engine/search?sitePath=us&query=tshirt>
{'price': 30, 'salePrice': 18}
2022-09-26 11:35:58 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.adidas.com/api/plp/content-engine/search?sitePath=us&query=tshirt>
{'price': 45, 'salePrice': 45}
2022-09-26 11:35:58 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.adidas.com/api/plp/content-engine/search?sitePath=us&query=tshirt>
{'price': 35, 'salePrice': 35}
2022-09-26 11:35:58 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.adidas.com/api/plp/content-engine/search?sitePath=us&query=tshirt>
{'price': 35, 'salePrice': 21}
2022-09-26 11:35:58 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.adidas.com/api/plp/content-engine/search?sitePath=us&query=tshirt>
{'price': 25, 'salePrice': 23}
2022-09-26 11:35:58 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.adidas.com/api/plp/content-engine/search?sitePath=us&query=tshirt>
{'price': 32, 'salePrice': 32}
2022-09-26 11:35:58 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.adidas.com/api/plp/content-engine/search?sitePath=us&query=tshirt>
{'price': 40, 'salePrice': 40}
2022-09-26 11:35:58 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.adidas.com/api/plp/content-engine/search?sitePath=us&query=tshirt>
{'price': 35, 'salePrice': 35}
2022-09-26 11:35:58 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.adidas.com/api/plp/content-engine/search?sitePath=us&query=tshirt>
{'price': 25, 'salePrice': 15}
2022-09-26 11:35:58 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.adidas.com/api/plp/content-engine/search?sitePath=us&query=tshirt>
{'price': 25, 'salePrice': 25}
2022-09-26 11:35:58 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.adidas.com/api/plp/content-engine/search?sitePath=us&query=tshirt>
{'price': 30, 'salePrice': 18}
2022-09-26 11:35:58 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.adidas.com/api/plp/content-engine/search?sitePath=us&query=tshirt>
{'price': 40, 'salePrice': 40}
2022-09-26 11:35:58 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.adidas.com/api/plp/content-engine/search?sitePath=us&query=tshirt>
{'price': 25, 'salePrice': 25}
2022-09-26 11:35:58 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.adidas.com/api/plp/content-engine/search?sitePath=us&query=tshirt>
{'price': 40, 'salePrice': 40}
2022-09-26 11:35:58 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.adidas.com/api/plp/content-engine/search?sitePath=us&query=tshirt>
{'price': 110, 'salePrice': 110}
2022-09-26 11:35:58 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.adidas.com/api/plp/content-engine/search?sitePath=us&query=tshirt>
{'price': 35, 'salePrice': 35}
2022-09-26 11:35:58 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.adidas.com/api/plp/content-engine/search?sitePath=us&query=tshirt>
{'price': 22, 'salePrice': 22}
2022-09-26 11:35:58 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.adidas.com/api/plp/content-engine/search?sitePath=us&query=tshirt>
{'price': 40, 'salePrice': 40}
2022-09-26 11:35:58 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.adidas.com/api/plp/content-engine/search?sitePath=us&query=tshirt>
{'price': 35, 'salePrice': 32}
2022-09-26 11:35:58 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.adidas.com/api/plp/content-engine/search?sitePath=us&query=tshirt>
{'price': 25, 'salePrice': 23}
2022-09-26 11:35:58 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.adidas.com/api/plp/content-engine/search?sitePath=us&query=tshirt>
{'price': 25, 'salePrice': 25}
2022-09-26 11:35:58 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.adidas.com/api/plp/content-engine/search?sitePath=us&query=tshirt>
{'price': 30, 'salePrice': 30}

... so on

Md. Fazlul Hoque
  • 15,806
  • 5
  • 12
  • 32
  • So the key point here is `resp=response.json()`? – Redshoe Sep 26 '22 at 21:16
  • and how did you get the link address of `https://www.adidas.com/api/plp/content-engine/search?sitePath=us&query=tshirt` – Redshoe Sep 26 '22 at 21:19
  • That's API url and how to find out API url? You can take help and find very effective discussions from here:https://stackoverflow.com/questions/1820927/request-monitoring-in-chrome/3019085#3019085 – Md. Fazlul Hoque Sep 26 '22 at 21:23
  • Thank you for the answer! I have another question. I am not getting any output like yours, get the outputs of prices and saleprices. Instead, I am getting `2022-09-26 17:28:17 [scrapy.downloadermiddlewares.robotstxt] DEBUG: Forbidden by robots.txt: `. I tried changing the user_agent, and still to no avail. Any... advice? – Redshoe Sep 26 '22 at 21:36
  • Go to settings.py file and change the robots.txt = False instead of True – Md. Fazlul Hoque Sep 26 '22 at 21:38
  • Aha! It works! Thank you so much! So, would similar website behave the same as long as I get the API url and pull all the required data? – Redshoe Sep 26 '22 at 21:46