Questions tagged [pyspider]

Python based Powerful Spider(Web Crawler) System

Used for

Write script in python with powerful API
Powerful WebUI with script editor, task monitor, project manager and result viewer
MySQL, MongoDB, SQLite as database backend
Javascript pages supported!
Task priority, retry, periodical and recrawl by age or marks in index page (like update time)
Distributed architecture

38 questions

votes

2 answers

Can Scrapy be replaced by pyspider?

I've been using Scrapy web-scraping framework pretty extensively, but, recently I've discovered that there is another framework/system called pyspider, which, according to it's github page, is fresh, actively developed and popular. pyspider's home…

asked Dec 02 '14 at 06:33

alecxe

462,703
120
1,088
1,195

votes

1 answer

Python ValueError: Invalid header name b':authority

I see the ':' is error, but I can't find a way to solve it. ValueError: Invalid header name b':authority' It's the error: File "tmall.py", line 23, in get_url response = sessions.get(url=url,headers =headers) File…

python python-3.x header python-requests pyspider

asked Sep 15 '17 at 03:27

user8608946

votes

1 answer

what is the best way to keep columns names after doing OneHotEncoder in python?

What is the best way to keep column names after doing one hot encoder in python? All my features are categorical so I did like below: so, after import the dataset it looks like below PlaceID Date ... BlockedRet OverallSeverity 0 23620 …

python machine-learning one-hot-encoding pyspider

asked Nov 12 '19 at 09:56

user1941183

votes

0 answers

python error 104 Connection reset by peer

I can't figure out why I keep getting this error or how to fix it. I've ran a bunch of different URL's and this error doesn't happen every time. Is it something I can fix or something in my code I can fix or is this something out of my power to…

python python-2.7 web-scraping web-crawler pyspider

asked Jun 05 '17 at 19:23

Emily

votes

1 answer

Fail to scrape images with pyspider and phantomjs

Now I wish to scrape the all the images of the items (iphone) in this web page. First I extract all the links of the image, and then send a request one by one to the src and download them to the folder '/phone/'. Here is my code: from…

phantomjs web-crawler pyquery pyspider

asked Jun 02 '16 at 11:19

u3728666

vote

2 answers

KeyError: 'Spider not found:

I am following the youtube video https://youtu.be/s4jtkzHhLzY and have reached 13:45, when the creator runs his spider. I have followed the tutorial precisely, yet my code refuses to run. This is my actual code. I imported scrapy too as well. Can…

python scrapy web-crawler pyspider

asked Dec 29 '21 at 04:03

lnky

vote

1 answer

libcurl link-time ssl backends (schannel) do not include

ImportError: pycurl: libcurl link-time ssl backends (schannel) do not include compile-time ssl backend (openssl) i use win10 + py3.9 + pycurl-7.44.1-cp39-cp39-win_amd64.whl + i can't use import ,please help me

python python-3.x xml pycurl pyspider

asked Aug 31 '21 at 15:31

freecoder

vote

0 answers

Spider - Python (3.7) - How to group long code

I have DataFrame that has 500 rows, is there a way how to group it so it appears as a little (+) and I will be able to open that again (-) if needed? It will make my code more readable. I did not find anything within the Toolbar, so not sure if this…

python pyspider

asked Apr 22 '21 at 09:19

NeverTooOldToLearn

vote

0 answers

Scarpy-redis slows down item pipelines

I'm just using the dupfilter and scheduler reimplemented by scrapy-redis to support recovering from interruptions (Redis only contains two keys - dmoz:dupfilter & dmoz:requests), and using one item pipeline to store items in remote MongoDB. However,…

python redis scrapy web-crawler pyspider

asked Jul 22 '20 at 04:26

Rossil

vote

1 answer

ReactorNotRestartable error when running two spiders sequentially using CrawlerProcess

I'm trying to run two spiders sequentially, here is the structure of my module class tmallSpider(scrapy.Spider): name = 'tspider' ... class jdSpider(scrapy.Spider): name = 'jspider' ... process =…

python scrapy web-crawler pyspider

asked Jul 09 '20 at 08:46

Tianhe Xie

vote

1 answer

warning in building webcrawler in python using beautifulsoup

I am trying to build a simple web crawler that gives the URLs of every legion product displayed on amazon.in if the key searched is 'legion'. I am using the following code: import requests from bs4 import BeautifulSoup def…

python beautifulsoup web-crawler pyspider

asked May 08 '20 at 16:28

Dastan

vote

1 answer

Probelm with Spyder 4.01 and Python 2.7

Spyder (4.1) does not work anymore with python 2.7 in Anaconda environnement. When I launch spyder it does not open and I do not have any message. If I launch spyder with python 3.8 it works. conda environment : What can I do?

python-2.7 pyspider

asked Jan 30 '20 at 14:55

PPP

vote

3 answers

How to find sitemap in each domain and sub domain using python

I want to know how to find sitemap in each domain and sub domain using python? Some examples: abcd.com/sitemap.xml abcd.com/sitemap.html abcd.com/sitemap.html sub.abcd.com/sitemap.xml And etc. What is the most probable sitemap names, locations and…

python beautifulsoup scrapy sitemap pyspider

asked Oct 26 '19 at 21:58

William Johnson

vote

1 answer

pyspider phantom is not enabled ;501 Sever Error

I used pyspider to crawl a website, when using PhantomJs, an error occurred as follows: I've searched for the solutions in https://github.com/binux/pyspider/issues/215, the author's seemed to solute it, so I tried, but it didn't still. How to…

phantomjs pyspider

asked Oct 23 '18 at 16:45

Jin.GH

vote

1 answer

Getting ImportError when starting pyspider in Terminal

When I start pyspider by pyspider all in terminal, it pops out an ImportError: ImportError: cannot import name 'Curlasync_HTTPClient' from…

python python-3.x tornado macos-high-sierra pyspider

asked May 30 '18 at 23:38

HneryInSH

2 3 Next