3

I'm a student, and for a project I'm collecting information on brands. I found this website called Kit: Kit Page that I want to scrape for brands. It has almost 500 pages, and I wrote a Scrapy Spider in Python 3 that go through each of the pages and copies the list to a dictionary, but I can't figure out the xpath or css to actually get the list info. Here's my items.py:

import scrapy

class KitcreatorwebscraperItem(scrapy.Item):
    creator = scrapy.Field()

and here's my spider:

import scrapy

class KitCreatorSpider(scrapy.Spider):
    name = "kitCreators"
    pageNumber = 1

    start_urls = [
        'https://kit.com/brands?page=1',
    ]

    while pageNumber <= 478:
        newUrl = "https://kit.com/brands?page=" + str(pageNumber)
        start_urls.append(newUrl)
        pageNumber += 1

    def parse(self, response):
        for li in response.xpath('//div[@class="section group"][0]'):

It runs successfully, but I have been unable to write an xpath that gets the data I need. What path is necessary, and how do I implement that in the code?

Thomas Hughes
  • 51
  • 1
  • 6

1 Answers1

0

You can try below Xpath to extract brand names:

//a[@class="brandsView-list-item-link ng-binding"]/text()

P.S. I would suggest you not to create the list of URLs. It seem to be redundant piece of code. Instead you might use for loop like:

for page_number in range(479):
    url = "https://kit.com/brands?page=%s" % page_number
    ...handle current page source...

Update

You can try Selenium + PhantomJS to get required data from dynamic content:

from selenium import webdriver

driver = webdriver.PhantomJS()
brands_list = []

for page in range(1, 480):
    driver.get("https://kit.com/brands?page=%s" % page)
    [brands_list.append(brand.text) for brand in driver.find_elements_by_xpath('//a[@class="brandsView-list-item-link ng-binding"]')]

print(brands_list)
Andersson
  • 51,635
  • 17
  • 77
  • 129
  • Hi Andersson, when I plug this in in this form: **def parse(self, response): yield { "company":response.xpath('//a[@class="brandsView-list-item-link ng-binding"]/text()') }** I am still met with an empty list as the output. Could you provide more info on implementation or placement of this path in the code? – Thomas Hughes Jul 07 '17 at 18:54
  • This is because page content is dynamic and you cannot get it simply with `scrapy`. Check [this](https://stackoverflow.com/questions/30345623/scraping-dynamic-content-using-python-scrapy) – Andersson Jul 07 '17 at 19:02
  • I have been unsuccessful in implementing your attached method thus far, but I'll keep trying. If you have any recommendations or suggestions, they are absolutely welcome. Thanks! – Thomas Hughes Jul 07 '17 at 20:36