Questions tagged [scrapinghub]

a web scraping development and services company, supplies cloud-based web crawling platforms.

179 questions
11
votes
1 answer

Not able Running/deploying custom script with shub-image

I have problem for Running/deploying custom script with shub-image. setup.py from setuptools import setup, find_packages setup( name = 'EU-Crawler', version = '1.0', packages = find_packages(), scripts = [ …
parik
  • 2,313
  • 12
  • 39
  • 67
8
votes
4 answers

scrapy passing custom_settings to spider from script using CrawlerProcess.crawl()

I am trying to programatically call a spider through a script. I an unable to override the settings through the constructor using CrawlerProcess. Let me illustrate this with the default spider for scraping quotes from the official scrapy site (last…
hAcKnRoCk
  • 1,118
  • 3
  • 16
  • 30
6
votes
1 answer

Scrapy hidden memory leak

Background - TLDR: I have a memory leak in my project Spent a few days looking through the memory leak docs with scrapy and can't find the problem. I'm developing a medium size scrapy project, ~40k requests per day. I am hosting this using…
Hector Haffenden
  • 1,360
  • 10
  • 25
6
votes
0 answers

Pygsheets unable to find the server at www.googleapis.com

I'm trying to use pygsheets in a script on ScrapingHub. The pygsheets part of the script begins with: google_client = pygsheets.authorize(service_file=CREDENTIALS_FILENAME, no_cache=True) spreadsheet = google_client.open_by_key(SHEET_ID) Where…
5
votes
1 answer

Scrapy does not fetch markup on response.css

I've built a simple scrapy spider running on scrapinghub: class ExtractionSpider(scrapy.Spider): name = "extraction" allowed_domains = ['domain'] start_urls = ['http://somedomainstart'] user_agent = "Mozilla/5.0 (Windows NT 10.0;…
qubits
  • 1,227
  • 3
  • 20
  • 50
4
votes
1 answer

scrapy how to load urls from file at scrapinghub

I know how to load data into Scrapy spider from external source when working localy. But I strugle to find any info on how to deploy this file to scrapinghub and what path to use there. Now i use this approach from SH documentation - enter link…
Billy Jhon
  • 1,035
  • 15
  • 30
4
votes
2 answers

Download project's source-code from Scrapinghub

I have a project deployed on Scrapinghub, I do not have any copy of that code at all. How can I download the whole project's code on my localhost from Scrapinghub?
Umair Ayub
  • 19,358
  • 14
  • 72
  • 146
3
votes
1 answer

Splash - Scrapy - HAR data

In general I understand how to work with Scrapy and x-path to parse the html. However, I don't know how to grab the HAR data. mport scrapy from scrapy_splash import SplashRequest class QuotesSpider(scrapy.Spider): name = 'quotes' …
Zach
  • 421
  • 1
  • 5
  • 11
3
votes
1 answer

Why is scrapy with crawlera running so slow?

I am using scrapy 1.7.3 with crawlera (C100 plan from scrapinghub) and python 3.6. When running the spider with crawlera enabled I get about 20 - 40 items per minute. Without crawlera I get 750 - 1000 (but I get banned quickly of course). Have I…
Wramana
  • 183
  • 1
  • 4
  • 16
3
votes
1 answer

Use splash from scrapinghub scraping hub locally

I got a subscription for splash on scrapinghub and I want to use this from a script that is running on my local machine. The instructions I have found so far are: Edit the settings file: #I got this one from my scraping hub account SPLASH_URL =…
3
votes
1 answer

ScrapingHub Environment Variables Not Loaded

I'm deploying a bunch of spiders on ScrapingHub. The spider itself is working. I would like to change the feed output depending on whether the spider is running locally or on ScrapingHub (if it is running locally then output to a temp folder, if it…
Ze Xuan
  • 56
  • 6
3
votes
1 answer

scrapinghub starting job too slow

I am new in scraping and I am running different jobs on scrapinghub. I run them via their API. The problem is that starting the spider and initializing it takes too much time like 30 seconds. When I run it locally, it takes up to 5 seconds to finish…
Mara M
  • 153
  • 1
  • 1
  • 10
3
votes
2 answers

Scrapy and Splash times out for a specific site

I have an issue with Scrapy, Crawlera and Splash when trying the fetch responses from this site. I tried the following without luck: pure Scrapy shell - times out Scrapy + Crawlera - times out Scrapinghub Splash instance (small) - times…
3
votes
2 answers

How to install xvfb on Scrapinghub for using Selenium?

I use Python-Selenium in my spider (Scrapy), for using Selenium i should install xvfb on Scrapinghub. when i use apt-get for installing xvfb i have this error message: E: Could not open lock file /var/lib/dpkg/lock - open (13: Permission denied) …
parik
  • 2,313
  • 12
  • 39
  • 67
2
votes
1 answer

Scrapy crawlera authentication issue

I've been trying to use scrapy-crawlera as a proxy for scraping some data with scrapy. I've added these rows in settings.py: DOWNLOADER_MIDDLEWARES = { 'scrapy_crawlera.CrawleraMiddleware': 610, } CRAWLERA_ENABLED = True CRAWLERA_APIKEY =…
memeister
  • 53
  • 5
1
2 3
11 12