0

I'm trying to scrape the prices for shoes on the website in the code. I have no idea of knowing if my syntax is even correct. I could really use some help.

from scrapy.spider import BaseSpider
from scrapy import Field
from scrapy import Item
from scrapy.selector import HtmlXPathSelector

def Yeezy(Item):
 price = Field()


class YeezySpider(BaseSpider):
  name = "yeezy"
  allowed_domains = ["https://www.grailed.com/"]
  start_url = ['https://www.grailed.com/feed/0Qu8Gh1qHQ?page=2']

def parse(self, response):
    hxs = HtmlXPathSelector(response)
    price = hxs.css('.listing-price .sub-title:nth-child(1) span').extract()
    items = []
    for price in price:
        item = Yeezy()
        item["price"] = price.select(".listing-price .sub-title:nth-child(1) span").extract()
        items.append(item)
    yield item

The code is reporting this to the console:

ScrapyDeprecationWarning: YeezyScrape.spiders.yeezy_spider.YeezySpider     inherits from deprecated class scrapy.spider.BaseSpider, please inherit from      scrapy.spider.Spider. (warning only on first subclass, there may be others)
  class YeezySpider(BaseSpider):
2017-08-02 14:45:25-0700 [scrapy] INFO: Scrapy 0.25.1 started (bot: YeezyScrape)
2017-08-02 14:45:25-0700 [scrapy] INFO: Optional features available: ssl,     http11
2017-08-02 14:45:25-0700 [scrapy] INFO: Overridden settings:     {'NEWSPIDER_MODULE': 'YeezyScrape.spiders', 'SPIDER_MODULES':     ['YeezyScrape.spiders'], 'BOT_NAME': 'YeezyScrape'}   
2017-08-02 14:45:25-0700 [scrapy] INFO: Enabled extensions: LogStats,     TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2017-08-02 14:45:26-0700 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2017-08-02 14:45:26-0700 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2017-08-02 14:45:26-0700 [scrapy] INFO: Enabled item pipelines: 
2017-08-02 14:45:26-0700 [yeezy] INFO: Spider opened
2017-08-02 14:45:26-0700 [yeezy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-08-02 14:45:26-0700 [scrapy] DEBUG: Telnet console listening on     127.0.0.1:6023
2017-08-02 14:45:26-0700 [scrapy] DEBUG: Web service listening on 127.0.0.1:6080
2017-08-02 14:45:26-0700 [yeezy] INFO: Closing spider (finished)
2017-08-02 14:45:26-0700 [yeezy] INFO: Dumping Scrapy stats:
{'finish_reason': 'finished',
 'finish_time': datetime.datetime(2017, 8, 2, 21, 45, 26, 127000),
 'log_count/DEBUG': 2,
 'log_count/INFO': 7,
 'start_time': datetime.datetime(2017, 8, 2, 21, 45, 26, 125000)}
2017-08-02 14:45:26-0700 [yeezy] INFO: Spider closed (finished)

Process finished with exit code 0

At first I thought it was a problem with the css elements I entered but now I'm not so sure. This is my first time trying a project like this, I could really use some insight. Thank you in advance.

EDIT: So I tried simulating an xhr request in my code by following another example. This is what I have:

import scrapy
from scrapy.http import FormRequest
from scrapy.selector import HtmlXPathSelector
#from YeezyScrape import YeezyscrapeItem


class YeezySpider(scrapy.Spider):
    name = "yeezy"
    allowed_domains = ["www.grailed.com"]
    start_url = ["https://www.grailed.com/feed/0Qu8Gh1qHQ?page=2"]

    def parse(self, response):
        for i in range(0,2):
            yield FormRequest(url = 'https://mnrwefss2q-
dsn.algolia.net/1/indexes/Listing_production/query?x-algolia-
agent=Algolia%20for%20vanilla%20JavaScript%203.21.1&x-algolia-application-
id=MNRWEFSS2Q&x-algolia-api-key=a3a4de2e05d9e9b463911705fb6323ad', 
method="post", formdata={"params":"query:boost
filters:(strata:'basic' OR strata:'grailed' OR strata:'hype') AND 
(category_path:'footwear.slip_ons' OR category_path:'footwear.sandals' OR 
category_path:'footwear.lowtop_sneakers' OR category_path:'footwear.leather' 
OR category_path:'footwear.hitop_sneakers' OR 
category_path:'footwear.formal_shoes' OR category_path:'footwear.boots') AND 
(marketplace:grailed)
hitsPerPage:40
facets ["strata","size","category","category_size",
 "category_path","category_path_size",
"category_path_root_size","price_i","designers.id",
"location","marketplace"] 
page:2"}, callback=self.data_parse())

def data_parse(self, response):
    hxs = HtmlXPathSelector(response)
    prices = hxs.xpath("//p").extract()
    for prices in prices:
        price = prices.select("a/text()").extract()
        print price

I had to reformat things a little to fit the indentation differences between Python and Stackoverflow.

These are the logs reported in the terminal, again thanks for the help:

C:\Python27\python.exe C:/Python27/Lib/site-packages/scrapy/cmdline.py crawl yeezy -o price.json
2017-08-04 13:23:27-0700 [scrapy] INFO: Scrapy 0.25.1 started (bot: YeezyScrape)
2017-08-04 13:23:27-0700 [scrapy] INFO: Optional features available: ssl, http11
2017-08-04 13:23:27-0700 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'YeezyScrape.spiders', 'FEED_FORMAT': 'json', 'SPIDER_MODULES': ['YeezyScrape.spiders'], 'FEED_URI': 'price.json', 'BOT_NAME': 'YeezyScrape'}
2017-08-04 13:23:27-0700 [scrapy] INFO: Enabled extensions: FeedExporter, LogStats, TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState
2017-08-04 13:23:27-0700 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2017-08-04 13:23:27-0700 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2017-08-04 13:23:27-0700 [scrapy] INFO: Enabled item pipelines: 
2017-08-04 13:23:27-0700 [yeezy] INFO: Spider opened
2017-08-04 13:23:28-0700 [yeezy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-08-04 13:23:28-0700 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-08-04 13:23:28-0700 [scrapy] DEBUG: Web service listening on 127.0.0.1:6080
2017-08-04 13:23:28-0700 [yeezy] INFO: Closing spider (finished)
2017-08-04 13:23:28-0700 [yeezy] INFO: Dumping Scrapy stats:
    {'finish_reason': 'finished',
     'finish_time': datetime.datetime(2017, 8, 4, 20, 23, 28, 3000),
     'log_count/DEBUG': 2,
     'log_count/INFO': 7,
     'start_time': datetime.datetime(2017, 8, 4, 20, 23, 28, 1000)}
2017-08-04 13:23:28-0700 [yeezy] INFO: Spider closed (finished)

Process finished with exit code 0

1 Answers1

0

Seems like the products are retrieved by AJAX (see related: Can scrapy be used to scrape dynamic content from websites that are using AJAX?).
If you open up browsers webinspector, select network tab and look for XHR requests when the page loads, you can see this:

firebug network

Seems like a POST type request is being made with categories, filter etc. and a json of products is returned. You can reverse engineer it and replicate it in scrapy.

Granitosaurus
  • 20,530
  • 5
  • 57
  • 82