Scrapy Spider error processing correct link

Question

The start_url in this spider seems to be causing a problem but I am unsure why. Here is the project breakdown.

import scrapy
from statements.items import StatementsItem


class IncomeannualSpider(scrapy.Spider):
    name = 'incomeannual'
    start_urls = ['https://www.marketwatch.com/investing/stock/A/financials']

    def parse(self, response):
        item = {}

        item['ticker'] = response.xpath("//h1[contains(@id, 'instrumentname')]//text()").extract()
        item['sales2014'] = response.xpath("//td[./preceding-sibling::td[normalize-space()='Sales/Revenue']]/text()").extract()[0]
        item['sales2015'] = response.xpath("//td[./preceding-sibling::td[normalize-space()='Sales/Revenue']]/text()").extract()[1]
        item['sales2016'] = response.xpath("//td[./preceding-sibling::td[normalize-space()='Sales/Revenue']]/text()").extract()[2]
        item['sales2017'] = response.xpath("//td[./preceding-sibling::td[normalize-space()='Sales/Revenue']]/text()").extract()[3]
        item['sales2018'] = response.xpath("//td[./preceding-sibling::td[normalize-space()='Sales/Revenue']]/text()").extract()[4]
        item['sales2014rate'] = response.xpath("//td[./preceding-sibling::td[normalize-space()='Sales Growth']]/text()").extract()[0]
        item['sales2015rate'] = response.xpath("//td[./preceding-sibling::td[normalize-space()='Sales Growth']]/text()").extract()[1]
        item['sales2016rate'] = response.xpath("//td[./preceding-sibling::td[normalize-space()='Sales Growth']]/text()").extract()[2]
        item['sales2017rate'] = response.xpath("//td[./preceding- sibling::td[normalize-space()='Sales Growth']]/text()").extract()[3]
        item['sales2018rate'] = response.xpath("//td[./preceding-sibling::td[normalize-space()='Sales Growth']]/text()").extract()[4]
        item['cogs2014'] = response.xpath("//td[./preceding-sibling::td[normalize-space()='Cost of Goods Sold (COGS) incl. D&A']]/text()").extract()[0]
        item['cogs2015'] = response.xpath("//td[./preceding-sibling::td[normalize-space()='Cost of Goods Sold (COGS) incl. D&A']]/text()").extract()[1]
        item['cogs2016'] = response.xpath("//td[./preceding-sibling::td[normalize-space()='Cost of Goods Sold (COGS) incl. D&A']]/text()").extract()[2]
        item['cogs2017'] = response.xpath("//td[./preceding-sibling::td[normalize-space()='Cost of Goods Sold (COGS) incl. D&A']]/text()").extract()[3]
        item['cogs2018'] = response.xpath("//td[./preceding-sibling::td[normalize-space()='Cost of Goods Sold (COGS) incl. D&A']]/text()").extract()[4]
        item['cogs2014rate'] = response.xpath("//td[./preceding-sibling::td[normalize-space()='COGS Growth']]/text()").extract()[0]
        item['cogs2015rate'] = response.xpath("//td[./preceding-sibling::td[normalize-space()='COGS Growth']]/text()").extract()[1]
        item['cogs2016rate'] = response.xpath("//td[./preceding-sibling::td[normalize-space()='COGS Growth']]/text()").extract()[2]
        item['cogs2017rate'] = response.xpath("//td[./preceding-sibling::td[normalize-space()='COGS Growth']]/text()").extract()[3]
        item['cogs2018rate'] = response.xpath("//td[./preceding-sibling::td[normalize-space()='COGS Growth']]/text()").extract()[4]
        item['pretaxincome2014'] = response.xpath("//td[./preceding-sibling::td[normalize-space()='Pretax Income']]/text()").extract()[0]
        item['pretaxincome2015'] = response.xpath("//td[./preceding-sibling::td[normalize-space()='Pretax Income']]/text()").extract()[1]
        item['pretaxincome2016'] = response.xpath("//td[./preceding-sibling::td[normalize-space()='Pretax Income']]/text()").extract()[2]
        item['pretaxincome2017'] = response.xpath("//td[./preceding-sibling::td[normalize-space()='Pretax Income']]/text()").extract()[3]
        item['pretaxincome2018'] = response.xpath("//td[./preceding-sibling::td[normalize-space()='Pretax Income']]/text()").extract()[4]
        item['pretaxincome2014rate'] = response.xpath("//td[./preceding-sibling::td[normalize-space()='Pretax Income Growth']]/text()").extract()[0]
        item['pretaxincome2015rate'] = response.xpath("//td[./preceding-sibling::td[normalize-space()='Pretax Income Growth']]/text()").extract()[1]
        item['pretaxincome2016rate'] = response.xpath("//td[./preceding-sibling::td[normalize-space()='Pretax Income Growth']]/text()").extract()[2]
        item['pretaxincome2017rate'] = response.xpath("//td[./preceding-sibling::td[normalize-space()='Pretax Income Growth']]/text()").extract()[3]
        item['pretaxincome2018rate'] = response.xpath("//td[./preceding-sibling::td[normalize-space()='Pretax Income Growth']]/text()").extract()[4]
        item['netincome2014'] = response.xpath("//td[./preceding-sibling::td[normalize-space()='Net Income']]/text()").extract()[0]
        item['netincome2015'] = response.xpath("//td[./preceding-sibling::td[normalize-space()='Net Income']]/text()").extract()[1]
        item['netincome2016'] = response.xpath("//td[./preceding-sibling::td[normalize-space()='Net Income']]/text()").extract()[2]
        item['netincome2017'] = response.xpath("//td[./preceding-sibling::td[normalize-space()='Net Income']]/text()").extract()[3]
        item['netincome2018'] = response.xpath("//td[./preceding-sibling::td[normalize-space()='Net Income']]/text()").extract()[4]
        item['netincome2014rate'] = response.xpath("//td[./preceding-sibling::td[normalize-space()='Net Income Growth']]/text()").extract()[0]
        item['netincome2015rate'] = response.xpath("//td[./preceding-sibling::td[normalize-space()='Net Income Growth']]/text()").extract()[1]
        item['netincome2016rate'] = response.xpath("//td[./preceding-sibling::td[normalize-space()='Net Income Growth']]/text()").extract()[2]
        item['netincome2017rate'] = response.xpath("//td[./preceding-sibling::td[normalize-space()='Net Income Growth']]/text()").extract()[3]
        item['netincome2018rate'] = response.xpath("//td[./preceding-sibling::td[normalize-space()='Net Income Growth']]/text()").extract()[4]
        item['eps2014'] = response.xpath("//td[./preceding-sibling::td[normalize-space()='EPS (Basic)']]/text()").extract()[0]
        item['eps2015'] = response.xpath("//td[./preceding-sibling::td[normalize-space()='EPS (Basic)']]/text()").extract()[1]
        item['eps2016'] = response.xpath("//td[./preceding-sibling::td[normalize-space()='EPS (Basic)']]/text()").extract()[2]
        item['eps2017'] = response.xpath("//td[./preceding-sibling::td[normalize-space()='EPS (Basic)']]/text()").extract()[3]
        item['eps2018'] = response.xpath("//td[./preceding-sibling::td[normalize-space()='EPS (Basic)']]/text()").extract()[4]
        item['eps2014rate'] = response.xpath("//td[./preceding-sibling::td[normalize-space()='EPS (Basic) Growth']]/text()").extract()[0]
        item['eps2015rate'] = response.xpath("//td[./preceding-sibling::td[normalize-space()='EPS (Basic) Growth']]/text()").extract()[1]
        item['eps2016rate'] = response.xpath("//td[./preceding-sibling::td[normalize-space()='EPS (Basic) Growth']]/text()").extract()[2]
        item['eps2017rate'] = response.xpath("//td[./preceding-sibling::td[normalize-space()='EPS (Basic) Growth']]/text()").extract()[3]
        item['eps2018rate'] = response.xpath("//td[./preceding-sibling::td[normalize-space()='EPS (Basic) Growth']]/text()").extract()[4]
        item['eps2014altrate'] = response.xpath("//td[./preceding-sibling::td[normalize-space()='EPS (Basic) - Growth']]/text()").extract()[0]
        item['eps2015altrate'] = response.xpath("//td[./preceding-sibling::td[normalize-space()='EPS (Basic) - Growth']]/text()").extract()[1]
        item['eps2016altrate'] = response.xpath("//td[./preceding-sibling::td[normalize-space()='EPS (Basic) - Growth']]/text()").extract()[2]
        item['eps2017altrate'] = response.xpath("//td[./preceding-sibling::td[normalize-space()='EPS (Basic) - Growth']]/text()").extract()[3]
        item['eps2018altrate'] = response.xpath("//td[./preceding-sibling::td[normalize-space()='EPS (Basic) - Growth']]/text()").extract()[4]
        item['ebitda2014'] = response.xpath("//td[./preceding-sibling::td[normalize-space()='EBITDA']]/text()").extract()[0]
        item['ebitda2015'] = response.xpath("//td[./preceding-sibling::td[normalize-space()='EBITDA']]/text()").extract()[1]
        item['ebitda2016'] = response.xpath("//td[./preceding-sibling::td[normalize-space()='EBITDA']]/text()").extract()[2]
        item['ebitda2017'] = response.xpath("//td[./preceding-sibling::td[normalize-space()='EBITDA']]/text()").extract()[3]
        item['ebitda2018'] = response.xpath("//td[./preceding-sibling::td[normalize-space()='EBITDA']]/text()").extract()[4]
        item['ebitda2014rate'] = response.xpath("//td[./preceding-sibling::td[normalize-space()='EBITDA Growth']]/text()").extract()[0]
        item['ebitda2015rate'] = response.xpath("//td[./preceding-sibling::td[normalize-space()='EBITDA Growth']]/text()").extract()[1]
        item['ebitda2016rate'] = response.xpath("//td[./preceding-sibling::td[normalize-space()='EBITDA Growth']]/text()").extract()[2]
        item['ebitda2017rate'] = response.xpath("//td[./preceding-sibling::td[normalize-space()='EBITDA Growth']]/text()").extract()[3]
        item['ebitda2018rate'] = response.xpath("//td[./preceding-sibling::td[normalize-space()='EBITDA Growth']]/text()").extract()[4]

        yield item

All of the xpaths were checked with the start_url in the shell and seem to be working just fine.

2019-03-17 10:25:06 [scrapy.utils.log] INFO: Scrapy 1.6.0 started (bot: 
statements)
2019-03-17 10:25:06 [scrapy.utils.log] INFO: Versions: lxml 4.3.1.0, libxml2 
2.9.5, cssselect 1.0.3, parsel 1.5.1, w3lib 1.20.0, Twisted 18.9.0, Python 
3.7.2 (tags/v3.7.2:9a3ffc0492, Dec 23 2018, 23:09:28) [MSC v.1916 64 bit 
(AMD64)], pyOpenSSL 19.0.0 (OpenSSL 1.1.1a  20 Nov 2018), cryptography 2.5, 
Platform Windows-10-10.0.17763-SP0
2019-03-17 10:25:06 [scrapy.crawler] INFO: Overridden settings: {'BOT_NAME': 
'statements', 'FEED_EXPORT_ENCODING': 'utf-8', 'FEED_FORMAT': 'csv', 
'FEED_URI': 'sdasda.csv', 'NEWSPIDER_MODULE': 'statements.spiders', 
'ROBOTSTXT_OBEY': True, 'SPIDER_MODULES': ['statements.spiders'], 
'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 
(KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36'}
2019-03-17 10:25:06 [scrapy.extensions.telnet] INFO: Telnet Password: 
3580241d541f00bb
2019-03-17 10:25:06 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
 'scrapy.extensions.telnet.TelnetConsole',
 'scrapy.extensions.feedexport.FeedExporter',
 'scrapy.extensions.logstats.LogStats']
2019-03-17 10:25:06 [scrapy.middleware] INFO: Enabled downloader 
middlewares:
['scrapy.downloadermiddlewares.robotstxt.RobotsTxtMiddleware',
 'scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
 'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
 'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
 'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
 'statements.middlewares.StatementsDownloaderMiddleware',
 'scrapy.downloadermiddlewares.retry.RetryMiddleware',
 'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
 'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
 'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
 'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
 'scrapy.downloadermiddlewares.stats.DownloaderStats']
2019-03-17 10:25:06 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
 'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
 'scrapy.spidermiddlewares.referer.RefererMiddleware',
 'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
 'scrapy.spidermiddlewares.depth.DepthMiddleware']
2019-03-17 10:25:06 [scrapy.middleware] INFO: Enabled item pipelines:
['statements.pipelines.StatementsPipeline']
2019-03-17 10:25:06 [scrapy.core.engine] INFO: Spider opened
2019-03-17 10:25:06 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 
pages/min), scraped 0 items (at 0 items/min)
2019-03-17 10:25:06 [incomeannual] INFO: Spider opened: incomeannual
2019-03-17 10:25:06 [scrapy.extensions.telnet] INFO: Telnet console 
listening on 127.0.0.1:6024
2019-03-17 10:25:07 [scrapy.core.engine] DEBUG: Crawled (200) <GET 
https://www.marketwatch.com/robots.txt> (referer: None)
2019-03-17 10:25:07 [scrapy.core.engine] DEBUG: Crawled (200) <GET 
https://www.marketwatch.com/investing/stock/A/financials> (referer: None)
2019-03-17 10:25:07 [scrapy.core.scraper] ERROR: Spider error processing 
<GET https://www.marketwatch.com/investing/stock/A/financials> (referer: 
None)
Traceback (most recent call last):
  File "c:\users\jesse\appdata\local\programs\python\python37\lib\site- 
   packages\scrapy\utils\defer.py", line 102, in iter_errback
        yield next(it)
  File "c:\users\jesse\appdata\local\programs\python\python37\lib\site- 
   packages\scrapy\spidermiddlewares\offsite.py", line 29, in 
process_spider_output
    for x in result:
  File "c:\users\jesse\appdata\local\programs\python\python37\lib\site- 
   packages\scrapy\spidermiddlewares\referer.py", line 339, in <genexpr>
    return (_set_referer(r) for r in result or ())
  File "c:\users\jesse\appdata\local\programs\python\python37\lib\site- 
   packages\scrapy\spidermiddlewares\urllength.py", line 37, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "c:\users\jesse\appdata\local\programs\python\python37\lib\site- 
   packages\scrapy\spidermiddlewares\depth.py", line 58, in <genexpr>
    return (r for r in result or () if _filter(r))
  File "C:\Users\Jesse\Files\Financial\statements\statements\spiders\incomeannual.py", 
line 64, in parse
    item['eps2014altrate'] = response.xpath("//td[./preceding- 
sibling::td[normalize-space()='EPS (Basic) - Growth']]/text()").extract()[0]
IndexError: list index out of range
2019-03-17 10:25:07 [scrapy.core.engine] INFO: Closing spider (finished)
2019-03-17 10:25:07 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 636,
 'downloader/request_count': 2,
 'downloader/request_method_count/GET': 2,
 'downloader/response_bytes': 25693,
 'downloader/response_count': 2,
 'downloader/response_status_count/200': 2,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2019, 3, 17, 14, 25, 7, 786531),
 'log_count/DEBUG': 2,
 'log_count/ERROR': 1,
 'log_count/INFO': 10,
 'response_received_count': 2,
 'robotstxt/request_count': 1,
 'robotstxt/response_count': 1,
 'robotstxt/response_status_count/200': 1,
 'scheduler/dequeued': 1,
 'scheduler/dequeued/memory': 1,
 'scheduler/enqueued': 1,
 'scheduler/enqueued/memory': 1,
 'spider_exceptions/IndexError': 1,
 'start_time': datetime.datetime(2019, 3, 17, 14, 25, 6, 856319)}
2019-03-17 10:25:07 [scrapy.core.engine] INFO: Spider closed (finished)

This site requires a the USER_AGENT setting to be enabled to allow scraping. I've tried to work with specifying headers in the settings.py but this spider will actually be using over 5000 start_urls and I'm not sure how to use this setting with multiple urls. I've used this setup with multiple other projects and they work fine.

Any advice will be very much appreciated! Thanks!

malberts · Accepted Answer · 2019-03-17T15:31:29.340

The error in your log is because that specific XPath returns nothing (tested in scrapy shell):

>>> response.xpath("//td[./preceding-sibling::td[normalize-space()='EPS (Basic) - Growth']]/text()").extract()
[]
>>> response.xpath("//td[./preceding-sibling::td[normalize-space()='EPS (Basic) - Growth']]/text()").extract()[0]
Traceback (most recent call last):
  File "<console>", line 1, in <module>
IndexError: list index out of range

You need to check the length of the selector result before getting an index, because it is not safe to assume that an index exists. There are various shorthand solutions here: Get value at list/array index or "None" if out of range in Python

Here is one example:

values = response.xpath("//td[./preceding-sibling::td[normalize-space()='EPS (Basic) - Growth']]/text()").extract()
item['eps2014altrate'] = value[0] if 0 < len(values) else None
item['eps2015altrate'] = value[1] if 1 < len(values) else None
item['eps2016altrate'] = value[2] if 2 < len(values) else None
item['eps2017altrate'] = value[3] if 3 < len(values) else None
item['eps2018altrate'] = value[4] if 4 < len(values) else None

You can make it a bit less verbose by writing a helper function, like this. Either way, you should use this pattern everywhere, not just for the failing XPath.

Try this: scrapy shell -s USER_AGENT="Mozilla/5.0..." "https://www.marketwatch.com/investing/stock/A/financials" — jayjey, Mar 17 '19 at 15:33
@jayjey I already did that. That specifc XPath does not work for that page. Even in the browser's developer tools it returns nothing. Or rather, it just returns nothing for that specific page, even if it returns something for other pages. But either way, it is better to check for lengths before you try to get an index value. — malberts, Mar 17 '19 at 15:34
HEY you were right! I knew this xpath worked on only some of the links, so I thought I would include it anyways and just return a [], but apparently it caused some kind of error. I removed it and it works. Thanks for your fast help! — jayjey, Mar 17 '19 at 15:47

score 0 · Answer 2 · answered Mar 19 '19 at 11:08

Try this approach while testing:

try:
    item['ticker'] = response.xpath("//..//text()").extract()
except:
    item['ticker'] = "-"
try:
    item['sales2014'] = response.xpath("//../text()").extract()[0]
except:
    item['sales2014'] = "-"
try:
    item['sales2015'] = response.xpath("//../text()").extract()[1]
except:
    item['sales2015'] = "-"

Later, use a helper function to optimize the code.

Scrapy Spider error processing correct link

2 Answers2