Scrapy Spider Doesn't Return Any Information

Question

I'm a student, and for a project I'm collecting information on brands. I found this website called Kit: Kit Page that I want to scrape for brands. It has almost 500 pages, and I wrote a Scrapy Spider in Python 3 that go through each of the pages and copies the list to a dictionary, but I can't figure out the xpath or css to actually get the list info. Here's my items.py:

import scrapy

class KitcreatorwebscraperItem(scrapy.Item):
    creator = scrapy.Field()

and here's my spider:

import scrapy

class KitCreatorSpider(scrapy.Spider):
    name = "kitCreators"
    pageNumber = 1

    start_urls = [
        'https://kit.com/brands?page=1',
    ]

    while pageNumber <= 478:
        newUrl = "https://kit.com/brands?page=" + str(pageNumber)
        start_urls.append(newUrl)
        pageNumber += 1

    def parse(self, response):
        for li in response.xpath('//div[@class="section group"][0]'):

It runs successfully, but I have been unable to write an xpath that gets the data I need. What path is necessary, and how do I implement that in the code?

Andersson · Accepted Answer · 2017-07-07T21:09:49.113

0

You can try below Xpath to extract brand names:

//a[@class="brandsView-list-item-link ng-binding"]/text()

P.S. I would suggest you not to create the list of URLs. It seem to be redundant piece of code. Instead you might use for loop like:

for page_number in range(479):
    url = "https://kit.com/brands?page=%s" % page_number
    ...handle current page source...

Update

You can try Selenium + PhantomJS to get required data from dynamic content:

from selenium import webdriver

driver = webdriver.PhantomJS()
brands_list = []

for page in range(1, 480):
    driver.get("https://kit.com/brands?page=%s" % page)
    [brands_list.append(brand.text) for brand in driver.find_elements_by_xpath('//a[@class="brandsView-list-item-link ng-binding"]')]

print(brands_list)

edited Jul 07 '17 at 21:09

answered Jul 07 '17 at 18:49

Andersson

51,635
17
77
129

Hi Andersson, when I plug this in in this form: **def parse(self, response): yield { "company":response.xpath('//a[@class="brandsView-list-item-link ng-binding"]/text()') }** I am still met with an empty list as the output. Could you provide more info on implementation or placement of this path in the code? – Thomas Hughes Jul 07 '17 at 18:54
This is because page content is dynamic and you cannot get it simply with `scrapy`. Check [this](https://stackoverflow.com/questions/30345623/scraping-dynamic-content-using-python-scrapy) – Andersson Jul 07 '17 at 19:02
I have been unsuccessful in implementing your attached method thus far, but I'll keep trying. If you have any recommendations or suggestions, they are absolutely welcome. Thanks! – Thomas Hughes Jul 07 '17 at 20:36

Scrapy Spider Doesn't Return Any Information

1 Answers1