Questions tagged [pyspider]

Python based Powerful Spider(Web Crawler) System

Used for

  • Write script in python with powerful API
  • Powerful WebUI with script editor, task monitor, project manager and result viewer
  • MySQL, MongoDB, SQLite as database backend
  • Javascript pages supported!
  • Task priority, retry, periodical and recrawl by age or marks in index page (like update time)
  • Distributed architecture
38 questions
26
votes
2 answers

Can Scrapy be replaced by pyspider?

I've been using Scrapy web-scraping framework pretty extensively, but, recently I've discovered that there is another framework/system called pyspider, which, according to it's github page, is fresh, actively developed and popular. pyspider's home…
alecxe
  • 462,703
  • 120
  • 1,088
  • 1,195
6
votes
1 answer

Python ValueError: Invalid header name b':authority

I see the ':' is error, but I can't find a way to solve it. ValueError: Invalid header name b':authority' It's the error: File "tmall.py", line 23, in get_url response = sessions.get(url=url,headers =headers) File…
user8608946
  • 79
  • 1
  • 1
  • 6
2
votes
1 answer

what is the best way to keep columns names after doing OneHotEncoder in python?

What is the best way to keep column names after doing one hot encoder in python? All my features are categorical so I did like below: so, after import the dataset it looks like below PlaceID Date ... BlockedRet OverallSeverity 0 23620 …
2
votes
0 answers

python error 104 Connection reset by peer

I can't figure out why I keep getting this error or how to fix it. I've ran a bunch of different URL's and this error doesn't happen every time. Is it something I can fix or something in my code I can fix or is this something out of my power to…
Emily
  • 21
  • 1
  • 1
  • 4
2
votes
1 answer

Fail to scrape images with pyspider and phantomjs

Now I wish to scrape the all the images of the items (iphone) in this web page. First I extract all the links of the image, and then send a request one by one to the src and download them to the folder '/phone/'. Here is my code: from…
u3728666
  • 99
  • 2
  • 9
1
vote
2 answers

KeyError: 'Spider not found:

I am following the youtube video https://youtu.be/s4jtkzHhLzY and have reached 13:45, when the creator runs his spider. I have followed the tutorial precisely, yet my code refuses to run. This is my actual code. I imported scrapy too as well. Can…
lnky
  • 13
  • 4
1
vote
1 answer

libcurl link-time ssl backends (schannel) do not include

ImportError: pycurl: libcurl link-time ssl backends (schannel) do not include compile-time ssl backend (openssl) i use win10 + py3.9 + pycurl-7.44.1-cp39-cp39-win_amd64.whl + i can't use import ,please help me
freecoder
  • 11
  • 1
1
vote
0 answers

Spider - Python (3.7) - How to group long code

I have DataFrame that has 500 rows, is there a way how to group it so it appears as a little (+) and I will be able to open that again (-) if needed? It will make my code more readable. I did not find anything within the Toolbar, so not sure if this…
1
vote
0 answers

Scarpy-redis slows down item pipelines

I'm just using the dupfilter and scheduler reimplemented by scrapy-redis to support recovering from interruptions (Redis only contains two keys - dmoz:dupfilter & dmoz:requests), and using one item pipeline to store items in remote MongoDB. However,…
Rossil
  • 67
  • 5
1
vote
1 answer

ReactorNotRestartable error when running two spiders sequentially using CrawlerProcess

I'm trying to run two spiders sequentially, here is the structure of my module class tmallSpider(scrapy.Spider): name = 'tspider' ... class jdSpider(scrapy.Spider): name = 'jspider' ... process =…
Tianhe Xie
  • 261
  • 1
  • 10
1
vote
1 answer

warning in building webcrawler in python using beautifulsoup

I am trying to build a simple web crawler that gives the URLs of every legion product displayed on amazon.in if the key searched is 'legion'. I am using the following code: import requests from bs4 import BeautifulSoup def…
Dastan
  • 7
  • 1
1
vote
1 answer

Probelm with Spyder 4.01 and Python 2.7

Spyder (4.1) does not work anymore with python 2.7 in Anaconda environnement. When I launch spyder it does not open and I do not have any message. If I launch spyder with python 3.8 it works. conda environment : What can I do?
PPP
  • 11
  • 3
1
vote
3 answers

How to find sitemap in each domain and sub domain using python

I want to know how to find sitemap in each domain and sub domain using python? Some examples: abcd.com/sitemap.xml abcd.com/sitemap.html abcd.com/sitemap.html sub.abcd.com/sitemap.xml And etc. What is the most probable sitemap names, locations and…
1
vote
1 answer

pyspider phantom is not enabled ;501 Sever Error

I used pyspider to crawl a website, when using PhantomJs, an error occurred as follows: I've searched for the solutions in https://github.com/binux/pyspider/issues/215, the author's seemed to solute it, so I tried, but it didn't still. How to…
Jin.GH
  • 11
  • 2
1
vote
1 answer

Getting ImportError when starting pyspider in Terminal

When I start pyspider by pyspider all in terminal, it pops out an ImportError: ImportError: cannot import name 'Curlasync_HTTPClient' from…
HneryInSH
  • 63
  • 1
  • 6
1
2 3