1

I know it's not a error but I'm not understanding how to scrape startup india website, I'm trying to click some website given in startup India but I can't click on them because scrapy can't click on websites and whatever information I'm having can only be scraped by pressing that link.

import scrapy
from selenium import webdriver
import os

class ProductSpider(scrapy.Spider):
    name = "product_spider"
    allowed_domains = ['https://www.startupindia.gov.in/']
    start_urls = ['https://www.startupindia.gov.in/content/sih/en/search.html?industries=sih:industry/advertising&states=sih:location/india/andhra-pradesh&stages=Prototype&roles=Startup&page=0']

    def __init__(self):
        cwd = os.getcwd()
        self.driver = webdriver.Chrome("C:/Users/RAJ/PycharmProjects/WebCrawler/WebCrawler/WebCrawler/spiders/chromedriver.exe")
        self.profile = []

    def parse(self, response):
        self.driver.get(response.url)

        while True:
            next = self.driver.find_element_by_xpath('//*[@id="persona-results"]/div[1]/div/a/div[1]')

            try:
                next.click()

                # get the data and write it to scrapy items
            except:
                break

        self.driver.close()

By the way, my end goal is to go get all the profile details but I don't know-how (PS: This is the first time me doing web scraping)

1 Answers1

0

This sounds like something similar to the tutorial on Scrapy documentation as below. In general, you can try to refer to the #follow links to author pages , right click and inspect the place to "click" to get the css/xpath on your desired webpage.

https://docs.scrapy.org/en/latest/intro/tutorial.html

Alternatively, feel free to share what you have. Hope this helps!

import scrapy


    class AuthorSpider(scrapy.Spider):
        name = 'author'

        start_urls = ['http://quotes.toscrape.com/']

        def parse(self, response):
            # follow links to author pages
            for href in response.css('.author + a::attr(href)'):
                yield response.follow(href, self.parse_author)

            # follow pagination links
            for href in response.css('li.next a::attr(href)'):
                yield response.follow(href, self.parse)

        def parse_author(self, response):
            def extract_with_css(query):
                return response.css(query).get(default='').strip()

            yield {
                'name': extract_with_css('h3.author-title::text'),
                'birthdate': extract_with_css('.author-born-date::text'),
                'bio': extract_with_css('.author-description::text'),
            }