Hi everyone i have this problem, i want to scrape multiple url in the same domain and save the results on a json file, but the output only return n-times the results from the last url.
Maybe a example will help me to explain.
UPDATE WITH THE REAL CODE This is my code:
import scrapy
class Test(scrapy.Spider):
name= "testscraper"
allowed_domains=['ebird.org']
start_urls=[
'https://ebird.org/species/ostric2',
'https://ebird.org/species/ostric3',
'https://ebird.org/species/y00934',
'https://ebird.org/species/grerhe1',
'https://ebird.org/species/lesrhe2'
]
def start_requests(self):
for url in self.start_urls:
print('---------------------------')
print(url)
print('---------------------------')
yield scrapy.Request(url=url,callback=self.parse,dont_filter=True)
def parse(self,response):
print('***************************')
print(response.url)
print('***************************')
image = response.css('img').xpath('@src').get()
code = response.url[-7::]
common_name=response.xpath('//span[@class="Heading-main Media--hero-title"]//text()').get()
scientific_name=response.xpath('//span[@class="Heading-sub Heading-sub--sci Heading-sub--custom u-text-4-loose"]//text()').get()
description=response.xpath('//p[@class="u-stack-sm"]/text()').get()
if description:
description=description.split('\n',1)[0]
yield {
'code':code,
'scientific_name':scientific_name,
'common_name':common_name,
'description':description,
'image':image,
'url':response.url
}
So when i run :
scrapy crawl testscraper -O testscraper.json
The file testscraper.json have:
[
{"code": "lesrhe2", "scientific_name": "Rhea pennata", "common_name": "Lesser Rhea", "description": "This flightless South American relative of the Ostrich stands about 5 feet tall with a body about the size of a sheep; no similar species in its range. Rheas roam widely on open Patagonian steppe and also occur locally in open habitats of the Andes, mainly at very high elevations. Can be confiding where used to people, but in other areas wary, running strongly and quickly. Rheas occur singly or in groups, and males take care of the young. Adults have bold pale spots on the body, first-year birds are plainer overall.", "image": "https://cdn.download.ams.birds.cornell.edu/api/v1/asset/115691341/1800", "audio": "assetId\":524686", "url": "https://ebird.org/species/lesrhe2"},
{"code": "lesrhe2", "scientific_name": "Rhea pennata", "common_name": "Lesser Rhea", "description": "This flightless South American relative of the Ostrich stands about 5 feet tall with a body about the size of a sheep; no similar species in its range. Rheas roam widely on open Patagonian steppe and also occur locally in open habitats of the Andes, mainly at very high elevations. Can be confiding where used to people, but in other areas wary, running strongly and quickly. Rheas occur singly or in groups, and males take care of the young. Adults have bold pale spots on the body, first-year birds are plainer overall.", "image": "https://cdn.download.ams.birds.cornell.edu/api/v1/asset/115691341/1800", "audio": "assetId\":524686", "url": "https://ebird.org/species/lesrhe2"},
{"code": "lesrhe2", "scientific_name": "Rhea pennata", "common_name": "Lesser Rhea", "description": "This flightless South American relative of the Ostrich stands about 5 feet tall with a body about the size of a sheep; no similar species in its range. Rheas roam widely on open Patagonian steppe and also occur locally in open habitats of the Andes, mainly at very high elevations. Can be confiding where used to people, but in other areas wary, running strongly and quickly. Rheas occur singly or in groups, and males take care of the young. Adults have bold pale spots on the body, first-year birds are plainer overall.", "image": "https://cdn.download.ams.birds.cornell.edu/api/v1/asset/115691341/1800", "audio": "assetId\":524686", "url": "https://ebird.org/species/lesrhe2"},
{"code": "lesrhe2", "scientific_name": "Rhea pennata", "common_name": "Lesser Rhea", "description": "This flightless South American relative of the Ostrich stands about 5 feet tall with a body about the size of a sheep; no similar species in its range. Rheas roam widely on open Patagonian steppe and also occur locally in open habitats of the Andes, mainly at very high elevations. Can be confiding where used to people, but in other areas wary, running strongly and quickly. Rheas occur singly or in groups, and males take care of the young. Adults have bold pale spots on the body, first-year birds are plainer overall.", "image": "https://cdn.download.ams.birds.cornell.edu/api/v1/asset/115691341/1800", "audio": "assetId\":524686", "url": "https://ebird.org/species/lesrhe2"},
{"code": "lesrhe2", "scientific_name": "Rhea pennata", "common_name": "Lesser Rhea", "description": "This flightless South American relative of the Ostrich stands about 5 feet tall with a body about the size of a sheep; no similar species in its range. Rheas roam widely on open Patagonian steppe and also occur locally in open habitats of the Andes, mainly at very high elevations. Can be confiding where used to people, but in other areas wary, running strongly and quickly. Rheas occur singly or in groups, and males take care of the young. Adults have bold pale spots on the body, first-year birds are plainer overall.", "image": "https://cdn.download.ams.birds.cornell.edu/api/v1/asset/115691341/1800", "audio": "assetId\":524686", "url": "https://ebird.org/species/lesrhe2"}
]
The last dict but 5 times, one for every url.
I was searching for help and someboy recomend to have this settings:
DUPEFILTER_CLASS = 'scrapy.dupefilters.BaseDupeFilter'
But still not works.
I usually dont ask for help but i dont really understand what is happening. Maybe is a silly thing but i dont really see it. Please give me a hint if you know what is happening.
Actual settings
DOWNLOAD_DELAY = 30
DUPEFILTER_CLASS = 'scrapy.dupefilters.BaseDupeFilter'
CONCURRENT_REQUESTS_PER_DOMAIN=1
And the logs:
2021-08-17 00:32:16 [scrapy.utils.log] INFO: Scrapy 2.5.0 started (bot: ebird)
2021-08-17 00:32:16 [scrapy.utils.log] INFO: Versions: lxml 4.6.3.0, libxml2 2.9.10, cssselect 1.1.0, parsel 1.6.0, w3lib 1.22.0, Twisted 21.7.0, Python 3.8.5 (default, Sep 4 2020, 07:30:14) - [GCC 7.3.0], pyOpenSSL 20.0.1 (OpenSSL 1.1.1k 25 Mar 2021), cryptography 3.4.7, Platform Linux-5.4.0-80-generic-x86_64-with-glibc2.10
2021-08-17 00:32:16 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
2021-08-17 00:32:16 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'ebird',
'CONCURRENT_REQUESTS_PER_DOMAIN': 1,
'DOWNLOAD_DELAY': 10,
'DUPEFILTER_CLASS': 'scrapy.dupefilters.BaseDupeFilter',
'NEWSPIDER_MODULE': 'ebird.spiders',
'ROBOTSTXT_OBEY': True,
'SPIDER_MODULES': ['ebird.spiders']}
2021-08-17 00:32:16 [scrapy.extensions.telnet] INFO: Telnet Password: 00b83b5e4e0bedd7
2021-08-17 00:32:16 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole',
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.feedexport.FeedExporter',
'scrapy.extensions.logstats.LogStats']
2021-08-17 00:32:16 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2021-08-17 00:32:16 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2021-08-17 00:32:16 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2021-08-17 00:32:16 [scrapy.core.engine] INFO: Spider opened
2021-08-17 00:32:16 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2021-08-17 00:32:16 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
---------------------------
https://ebird.org/species/ostric2
---------------------------
---------------------------
https://ebird.org/species/ostric3
---------------------------
---------------------------
https://ebird.org/species/y00934
---------------------------
---------------------------
https://ebird.org/species/grerhe1
---------------------------
---------------------------
https://ebird.org/species/lesrhe2
---------------------------
2021-08-17 00:32:17 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://ebird.org/robots.txt> (referer: None)
2021-08-17 00:32:29 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://secure.birds.cornell.edu/cassso/login?service=https%3A%2F%2Febird.org%2Flogin%2Fcas%3Fportal%3Debird&gateway=true&locale=en> from <GET https://ebird.org/species/ostric2>
2021-08-17 00:32:29 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://secure.birds.cornell.edu/robots.txt> (referer: None)
2021-08-17 00:32:40 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://secure.birds.cornell.edu/cassso/login?service=https%3A%2F%2Febird.org%2Flogin%2Fcas%3Fportal%3Debird&gateway=true&locale=en> from <GET https://ebird.org/species/ostric3>
2021-08-17 00:32:57 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://secure.birds.cornell.edu/cassso/login?service=https%3A%2F%2Febird.org%2Flogin%2Fcas%3Fportal%3Debird&gateway=true&locale=en> from <GET https://ebird.org/species/y00934>
2021-08-17 00:33:01 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://secure.birds.cornell.edu/cassso/login?service=https%3A%2F%2Febird.org%2Flogin%2Fcas%3Fportal%3Debird&gateway=true&locale=en> from <GET https://ebird.org/species/grerhe1>
2021-08-17 00:33:15 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://secure.birds.cornell.edu/cassso/login?service=https%3A%2F%2Febird.org%2Flogin%2Fcas%3Fportal%3Debird&gateway=true&locale=en> from <GET https://ebird.org/species/lesrhe2>
2021-08-17 00:33:16 [scrapy.extensions.logstats] INFO: Crawled 2 pages (at 2 pages/min), scraped 0 items (at 0 items/min)
2021-08-17 00:33:28 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://ebird.org/login/cas?portal=ebird> from <GET https://secure.birds.cornell.edu/cassso/login?service=https%3A%2F%2Febird.org%2Flogin%2Fcas%3Fportal%3Debird&gateway=true&locale=en>
2021-08-17 00:33:41 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://ebird.org/login/cas?portal=ebird> from <GET https://secure.birds.cornell.edu/cassso/login?service=https%3A%2F%2Febird.org%2Flogin%2Fcas%3Fportal%3Debird&gateway=true&locale=en>
2021-08-17 00:33:54 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://ebird.org/login/cas?portal=ebird> from <GET https://secure.birds.cornell.edu/cassso/login?service=https%3A%2F%2Febird.org%2Flogin%2Fcas%3Fportal%3Debird&gateway=true&locale=en>
2021-08-17 00:34:05 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://ebird.org/login/cas?portal=ebird> from <GET https://secure.birds.cornell.edu/cassso/login?service=https%3A%2F%2Febird.org%2Flogin%2Fcas%3Fportal%3Debird&gateway=true&locale=en>
2021-08-17 00:34:16 [scrapy.extensions.logstats] INFO: Crawled 2 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2021-08-17 00:34:19 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://ebird.org/login/cas?portal=ebird> from <GET https://secure.birds.cornell.edu/cassso/login?service=https%3A%2F%2Febird.org%2Flogin%2Fcas%3Fportal%3Debird&gateway=true&locale=en>
2021-08-17 00:34:32 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://ebird.org/ebird/species/lesrhe2> from <GET https://ebird.org/login/cas?portal=ebird>
2021-08-17 00:34:42 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://ebird.org/ebird/species/lesrhe2> from <GET https://ebird.org/login/cas?portal=ebird>
2021-08-17 00:34:53 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://ebird.org/ebird/species/lesrhe2> from <GET https://ebird.org/login/cas?portal=ebird>
2021-08-17 00:35:14 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://ebird.org/ebird/species/lesrhe2> from <GET https://ebird.org/login/cas?portal=ebird>
2021-08-17 00:35:15 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://ebird.org/ebird/species/lesrhe2> from <GET https://ebird.org/login/cas?portal=ebird>
2021-08-17 00:35:16 [scrapy.extensions.logstats] INFO: Crawled 2 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2021-08-17 00:35:21 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://ebird.org/species/lesrhe2> from <GET https://ebird.org/ebird/species/lesrhe2>
2021-08-17 00:35:34 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://ebird.org/species/lesrhe2> from <GET https://ebird.org/ebird/species/lesrhe2>
2021-08-17 00:35:51 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://ebird.org/species/lesrhe2> from <GET https://ebird.org/ebird/species/lesrhe2>
2021-08-17 00:36:03 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://ebird.org/species/lesrhe2> from <GET https://ebird.org/ebird/species/lesrhe2>
2021-08-17 00:36:16 [scrapy.downloadermiddlewares.redirect] DEBUG: Redirecting (302) to <GET https://ebird.org/species/lesrhe2> from <GET https://ebird.org/ebird/species/lesrhe2>
2021-08-17 00:36:16 [scrapy.extensions.logstats] INFO: Crawled 2 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2021-08-17 00:36:28 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://ebird.org/species/lesrhe2> (referer: None)
***************************
https://ebird.org/species/lesrhe2
***************************
2021-08-17 00:36:28 [scrapy.core.scraper] DEBUG: Scraped from <200 https://ebird.org/species/lesrhe2>
{'code': 'lesrhe2', 'scientific_name': 'Rhea pennata', 'common_name': 'Lesser Rhea', 'description': 'This flightless South American relative of the Ostrich stands about 5 feet tall with a body about the size of a sheep; no similar species in its range. Rheas roam widely on open Patagonian steppe and also occur locally in open habitats of the Andes, mainly at very high elevations. Can be confiding where used to people, but in other areas wary, running strongly and quickly. Rheas occur singly or in groups, and males take care of the young. Adults have bold pale spots on the body, first-year birds are plainer overall.', 'image': 'https://cdn.download.ams.birds.cornell.edu/api/v1/asset/115691341/1800', 'audio': 'assetId":524686', 'url': 'https://ebird.org/species/lesrhe2'}
2021-08-17 00:36:42 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://ebird.org/species/lesrhe2> (referer: None)
***************************
https://ebird.org/species/lesrhe2
***************************
2021-08-17 00:36:42 [scrapy.core.scraper] DEBUG: Scraped from <200 https://ebird.org/species/lesrhe2>
{'code': 'lesrhe2', 'scientific_name': 'Rhea pennata', 'common_name': 'Lesser Rhea', 'description': 'This flightless South American relative of the Ostrich stands about 5 feet tall with a body about the size of a sheep; no similar species in its range. Rheas roam widely on open Patagonian steppe and also occur locally in open habitats of the Andes, mainly at very high elevations. Can be confiding where used to people, but in other areas wary, running strongly and quickly. Rheas occur singly or in groups, and males take care of the young. Adults have bold pale spots on the body, first-year birds are plainer overall.', 'image': 'https://cdn.download.ams.birds.cornell.edu/api/v1/asset/115691341/1800', 'audio': 'assetId":524686', 'url': 'https://ebird.org/species/lesrhe2'}
2021-08-17 00:36:54 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://ebird.org/species/lesrhe2> (referer: None)
***************************
https://ebird.org/species/lesrhe2
***************************
2021-08-17 00:36:55 [scrapy.core.scraper] DEBUG: Scraped from <200 https://ebird.org/species/lesrhe2>
{'code': 'lesrhe2', 'scientific_name': 'Rhea pennata', 'common_name': 'Lesser Rhea', 'description': 'This flightless South American relative of the Ostrich stands about 5 feet tall with a body about the size of a sheep; no similar species in its range. Rheas roam widely on open Patagonian steppe and also occur locally in open habitats of the Andes, mainly at very high elevations. Can be confiding where used to people, but in other areas wary, running strongly and quickly. Rheas occur singly or in groups, and males take care of the young. Adults have bold pale spots on the body, first-year birds are plainer overall.', 'image': 'https://cdn.download.ams.birds.cornell.edu/api/v1/asset/115691341/1800', 'audio': 'assetId":524686', 'url': 'https://ebird.org/species/lesrhe2'}
2021-08-17 00:37:16 [scrapy.extensions.logstats] INFO: Crawled 5 pages (at 3 pages/min), scraped 3 items (at 3 items/min)
2021-08-17 00:37:26 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://ebird.org/species/lesrhe2> (referer: None)
***************************
https://ebird.org/species/lesrhe2
***************************
2021-08-17 00:37:26 [scrapy.core.scraper] DEBUG: Scraped from <200 https://ebird.org/species/lesrhe2>
{'code': 'lesrhe2', 'scientific_name': 'Rhea pennata', 'common_name': 'Lesser Rhea', 'description': 'This flightless South American relative of the Ostrich stands about 5 feet tall with a body about the size of a sheep; no similar species in its range. Rheas roam widely on open Patagonian steppe and also occur locally in open habitats of the Andes, mainly at very high elevations. Can be confiding where used to people, but in other areas wary, running strongly and quickly. Rheas occur singly or in groups, and males take care of the young. Adults have bold pale spots on the body, first-year birds are plainer overall.', 'image': 'https://cdn.download.ams.birds.cornell.edu/api/v1/asset/115691341/1800', 'audio': 'assetId":524686', 'url': 'https://ebird.org/species/lesrhe2'}
2021-08-17 00:37:32 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://ebird.org/species/lesrhe2> (referer: None)
***************************
https://ebird.org/species/lesrhe2
***************************
2021-08-17 00:37:32 [scrapy.core.scraper] DEBUG: Scraped from <200 https://ebird.org/species/lesrhe2>
{'code': 'lesrhe2', 'scientific_name': 'Rhea pennata', 'common_name': 'Lesser Rhea', 'description': 'This flightless South American relative of the Ostrich stands about 5 feet tall with a body about the size of a sheep; no similar species in its range. Rheas roam widely on open Patagonian steppe and also occur locally in open habitats of the Andes, mainly at very high elevations. Can be confiding where used to people, but in other areas wary, running strongly and quickly. Rheas occur singly or in groups, and males take care of the young. Adults have bold pale spots on the body, first-year birds are plainer overall.', 'image': 'https://cdn.download.ams.birds.cornell.edu/api/v1/asset/115691341/1800', 'audio': 'assetId":524686', 'url': 'https://ebird.org/species/lesrhe2'}
2021-08-17 00:37:32 [scrapy.core.engine] INFO: Closing spider (finished)
2021-08-17 00:37:32 [scrapy.extensions.feedexport] INFO: Stored json feed (5 items) in: birdscraper.json
2021-08-17 00:37:32 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 7672,
'downloader/request_count': 27,
'downloader/request_method_count/GET': 27,
'downloader/response_bytes': 373670,
'downloader/response_count': 27,
'downloader/response_status_count/200': 7,
'downloader/response_status_count/302': 20,
'elapsed_time_seconds': 315.86914,
'feedexport/success_count/FileFeedStorage': 1,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2021, 8, 17, 4, 37, 32, 599873),
'httpcompression/response_bytes': 1367567,
'httpcompression/response_count': 6,
'item_scraped_count': 5,
'log_count/DEBUG': 32,
'log_count/INFO': 16,
'memusage/max': 66494464,
'memusage/startup': 56147968,
'response_received_count': 7,
'robotstxt/request_count': 2,
'robotstxt/response_count': 2,
'robotstxt/response_status_count/200': 2,
'scheduler/dequeued': 25,
'scheduler/dequeued/memory': 25,
'scheduler/enqueued': 25,
'scheduler/enqueued/memory': 25,
'start_time': datetime.datetime(2021, 8, 17, 4, 32, 16, 730733)}
2021-08-17 00:37:32 [scrapy.core.engine] INFO: Spider closed (finished)
I noticed the redirects, so i tried to change it with "dont_redirect" and after that the script prints the correct url but shows an error because the spider didnt enter to the page, so couldnt get any field.