0

It's my first time using Scrapy after watching a couple of tutroials, i'm trying to scrape this url

https://www.hackster.io/arduino/members

I want to get the links to every user profile. I ran my scrapy shell as follows

print(response.css("#main > div > div > div > div:nth-child(2) > div.hckui__layout__container > div.hckui__layout__wrapper1170 hckui__layout__fullScreenHeight > div > div.common-overlay__parent__1A_nT > div.grid__gridBasic__fjt5B grid__grid__1QeD6 grid__guttersH__2MYvz grid__guttersV__3M28R > div:nth-child(1) > div.undefined hckui__layout__flexCenterItems > div.user_card__content__1YVc5 > a.hckui__typography__bodyM hckui__typography__link hckui__typography__bold::attr(href)").extract())

but i get only [] as an output

I want to get the link as specified in the photo attached, can anyone please have a look and tell me if there is something wrong with my command?

url to be scraped

when i used google's chorme inspect option and copied the selector right away i got the same output

#main > div > div > div > div:nth-child(2) > div > div > div > div.common-overlay__parent__1A_nT > div > div:nth-child(1) > div > div > a
or even using

#main > div > div > div > div:nth-child(2) > div > div > div > div.common-overlay__parent__1A_nT > div 
YakovL
  • 7,557
  • 12
  • 62
  • 102

2 Answers2

1

That's because the html you see in the Chrome Console is built client-side in javascript. Scrapy by default does not interpret javascript and read the page source as it is sent by the server. See my answer here to find solutions for your problem.

Corentin Limier
  • 4,946
  • 1
  • 13
  • 24
1

To check what response the scrapy crawler is getting :-

  1. Open Terminal
  2. Run command scrapy shell https://www.hackster.io/arduino/members
  3. Run command view(response)

The response as seen to the crawler will be shown in your default web browser.

From this response you can check whether your crawler is getting the content you want to scrape!

As I can see from the response that you are not getting Arduino_Genuino in the response, this is definitely a case of client side javascript rendering.

Screenshot of the webpage as visible to the crawler.

To Scrape data from such pages, you need to use a javascript rendering engine such as scrapy-splash which runs on your localhost:8050

You have to pass the url to scrape to the splash rendering engine and after some timeout when the javascript is fully loaded into the splash at localhost:8050, you have to scrape the data from there.

Refer splash docs: https://splash.readthedocs.io/en/stable/api.html

nilansh bansal
  • 1,404
  • 1
  • 12
  • 23