0

I'm still doing my first few scrapy projects and I came across this website with an infinite scroll, where the requested URL is the same every time. I have tried to look for solutions but all material I read up involve URLs with some distinction (page no, text etc). How do I go about extracting all names that come up from https://www.baincapital.com/people. I have figured out my selectors etc but it's just returning the initially visible info. Any help will be appreciated. My code so far:

import scrapy
from scrapy_splash import SplashRequest


class BainPeople(scrapy.Spider):
    name = 'BainPeop'
    start_urls = [
    'https://www.baincapital.com/people'
    ]

    def start_requests(self):
        for url in self.start_urls:
            yield SplashRequest(url=url, callback = self.parse, args={"wait" : 3})

    def parse(self, response):
        name = response.css('h4 span::text').extract()
        links = response.css('div.col-xs-6.col-sm-4.col-md-6.col-lg-3.grid.staff a::attr(href)').extract()

        yield {'name' : name}

Same URL request

Updated code:

import scrapy
from selenium import webdriver

class BainpeopleSpider(scrapy.Spider):
    name = 'bainpeople'
    allowed_domains = ['https://www.baincapital.com/people']
    start_urls = ['http://www.baincapital.com/people/']


    def parse(self, response):
        driver = webdriver.Chrome(executable_path='C:/Users/uchit.madhok/Downloads/chromedriver_win32/chromedriver')
        driver.get('http://www.baincapital.com/people/')


        name = driver.find_elements_by_css_selector("h4 span").text
        links = driver.find_elements_by_css_selector('div.col-xs-6.col-sm-4.col-md-6.col-lg-3.grid.staff a').attr(href)

        yield {
        'name' : name
        'links' : links
        }

        driver.close()

Final Code:

import scrapy
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
import time

class BainpeopleSpider(scrapy.Spider):
    name = 'bainpeople'
    allowed_domains = ['baincapital.com']
    start_urls = ['http://www.baincapital.com/people/']

    def parse(self, response):
        browser = webdriver.Chrome(executable_path='C:/Users/uchit.madhok/Downloads/chromedriver_win32/chromedriver')
        browser.get('http://www.baincapital.com/people/')

        elm = browser.find_element_by_tag_name('html')
        i = 30

        while i>0:
            elm.send_keys(Keys.END)
            time.sleep(8)
            elm.send_keys(Keys.HOME)
            i = i-1


        links = list(map(lambda x: x.get_attribute('href'), browser.find_elements_by_css_selector('div.col-xs-6.col-sm-4.col-md-6.col-lg-3.grid.staff a')))
        for j in links:
            yield response.follow(str(j), callback = self.parse_detail)

    def parse_detail(self, response):
        name = response.css('h1.pageTitle::text').extract()
        title = response.css('div.__location::text')[0].extract()
        team = response.css('div.__location::text')[1].extract()
        location = response.css('div.__location::text')[2].extract()
        about = response.css('div.field-item.even p::text').extract()
        sector = response.css('ul.focus_link a::text').extract()

        yield {
        'name' : name,
        'title' : title,
        'team' : team,
        'location' : location,
        'about' : about,
        'sector' : sector
        }

1 Answers1

1

The thing you're trying to do is probably impossible using Scrapy alone. Accessing dynamic data is a well-known problem, but fortunately there are solutions. One of them is Selenium. Here you can see how it can be used to access dynamic data from a page and how to integrate it with Scrapy: selenium with scrapy for dynamic page

Łukasz Karczewski
  • 1,084
  • 8
  • 12
  • Okay, so I tried doing this with Selenium but I'm still to get the results that I'm looking for and I'm pretty sure there is something wrong with the selector I'm using. When I'm using `driver.find_element_by_css_selector('h4 span').text` I get only one name from the list and when I change it to `driver.find_elements_by_css_selector('h4 span').text` I get an error saying List object has no attribute called text. When running the code, chrome browser pops up and closes, so I know that's running fine. @Łukasz Karczewski – Uchit Madhok Feb 07 '20 at 13:03
  • Yeah, you should do something like this: ```map(lambda x: x.text, driver.find_elements_by_css_selector('h4 span'))``` – Łukasz Karczewski Feb 07 '20 at 13:05
  • I tried running that and it doesn't give an error, though didn't scrape anything. 'name': – Uchit Madhok Feb 07 '20 at 13:17
  • If you want to make it human readable you need to do this: ```list(map(lambda x: x.text, driver.find_elements_by_css_selector('h4 span')))``` – Łukasz Karczewski Feb 07 '20 at 13:20
  • Yes, that worked, however, I'm still getting just the list of first sixteen names that are visible and not the remaining ones that show up when you keep scrolling to the bottom of the page. I wanted to get all the 100 names that appear once you've scrolled to the last scroll. – Uchit Madhok Feb 07 '20 at 13:27
  • This question can be helpful when it comes to scrolling down with selenium: https://stackoverflow.com/questions/20986631/how-can-i-scroll-a-web-page-using-selenium-webdriver-in-python . Just scroll down and then run your scraping function. – Łukasz Karczewski Feb 07 '20 at 13:35
  • Great. Thanks @Łukasz Karczewski. Figured out the rest of it as well. – Uchit Madhok Feb 11 '20 at 10:22
  • Great, wish you luck in your future pursuits – Łukasz Karczewski Feb 11 '20 at 11:42
  • Thanks. One more clarification. What if I want to open a page on selenium webdriver that loads content through ajax requests? Will scrapy be able to see the content? If not, how do I combine scrapy splash here? – Uchit Madhok Feb 12 '20 at 10:17