0

I want to use Scrapy on LinkedIn but I got this output:

2018-10-23 13:36:38 [scrapy.utils.log] INFO: Scrapy 1.5.1 started (bot: scrapybot)
2018-10-23 13:36:38 [scrapy.utils.log] INFO: Versions: lxml 4.2.5.0, libxml2 2.9.5, cssselect 1.0.3, parsel 1.5.0, w3lib 1.19.0, Twisted 18.9.0, Python 3.7.1 (v3.7.1:260ec2c36a, Oct
20 2018, 14:05:16) [MSC v.1915 32 bit (Intel)], pyOpenSSL 18.0.0 (OpenSSL 1.1.0i  14 Aug 2018), cryptography 2.3.1, Platform Windows-10-10.0.17134-SP0
2018-10-23 13:36:38 [scrapy.crawler] INFO: Overridden settings: {'SPIDER_LOADER_WARN_ONLY': True}
2018-10-23 13:36:38 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.logstats.LogStats']
2018-10-23 13:36:39 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2018-10-23 13:36:39 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2018-10-23 13:36:39 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2018-10-23 13:36:39 [scrapy.core.engine] INFO: Spider opened
2018-10-23 13:36:39 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-10-23 13:36:39 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-10-23 13:36:39 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://www.linkedin.com/uas/login> (referer: None)
2018-10-23 13:36:40 [scrapy.core.engine] DEBUG: Crawled (200) <POST https://www.linkedin.com/uas/login> (referer: https://www.linkedin.com/uas/login)
2018-10-23 13:36:40 [scrapy.core.scraper] DEBUG: Scraped from <200 https://www.linkedin.com/uas/login>
{'user_name': None}
2018-10-23 13:36:40 [scrapy.core.engine] INFO: Closing spider (finished)
2018-10-23 13:36:40 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 1109,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 1,
 'downloader/request_method_count/POST': 1,
 'downloader/response_bytes': 41004,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 2,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2018, 10, 23, 11, 36, 40, 798034),
 'item_scraped_count': 1,
 'log_count/DEBUG': 4,
 'log_count/INFO': 7,
 'request_depth_max': 1,
 'response_received_count': 2,
 'scheduler/dequeued': 2,
 'scheduler/dequeued/memory': 2,
 'scheduler/enqueued': 2,
 'scheduler/enqueued/memory': 2,
 'start_time': datetime.datetime(2018, 10, 23, 11, 36, 39, 140792)}
2018-10-23 13:36:40 [scrapy.core.engine] INFO: Spider closed (finished)

I run the project this way: scrapy runspider linkedin-scrapy.py

session_key is a LinkedIn username and session_password is a LinkedIn password.

Here is the my code:

import scrapy


class LoginSpider(scrapy.Spider):
    name = 'linkedinName'
    login_url = 'https://www.linkedin.com/uas/login?'
    start_urls = [login_url]

    def parse(self, response):
        token = response.css('input[name="loginCsrfParam"]::attr(value)').extract_first()
        data = {
            'csrf_token': token,
            'session_key': '***',
            'session_password': '***',
        }

        yield scrapy.FormRequest(url=self.login_url, formdata=data, callback=self.parse_quotes)

    def parse_quotes(self, response):
        yield {
            'user_name': response.css('div.left-rail-container').extract_first()

        }

This results in

INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)

How to fix this output?

ChrisGPT was on strike
  • 127,765
  • 105
  • 273
  • 257
curious_
  • 3
  • 5

1 Answers1

-3

LinkedIn's User Agreement forbids scraping:

LinkedIn is committed to keeping its members' data safe and its website free from fraud and abuse. In order to protect our members' data and our website, we don't permit the use of any third party software, including "crawlers", bots, browser plug-ins, or browser extensions (also called "add-ons"), that scrapes, modifies the appearance of, or automates activity on LinkedIn's website. Such tools violate the User Agreement

It's likely that there are technical barriers preventing your scraper from working. Aside from that, LinkedIn makes heavy use of JavaScript, the consumption of which requires significant extra work.

Even if you technically can scrape LinkedIn you'd be breaking the User Agreement. I strongly advise you not to do this.

ChrisGPT was on strike
  • 127,765
  • 105
  • 273
  • 257