Looking at Twitter: www.twitter.com/twitter
You will see that the amount of followers are shown as 57.9M but if you hover over that value you will see the exact amount of followers.
This appears in the source as:
<span class="ProfileNav-value" data-count="57939946" data-is-compact="true">57.9M</span>
When I inspect this span on Chrome I use:
(//ul[@class='ProfileNav-list']/li/a/span[@class='ProfileNav-value']/@data-count)[3]
I am trying to extract just the attribute "data-count" using the above:
def parseTwitter(self, response):
company_name=response.meta['company_name']
l=ItemLoader(item=TwitterItem(), response=response)
l.add_value('company_name', company_name)
l.add_xpath('twitter_tweets', "(//ul[@class='ProfileNav-list']/li/a/span[@class='ProfileNav-value']/@data-count)[1]/text()")
l.add_xpath('twitter_following', "(//ul[@class='ProfileNav-list']/li/a/span[@class='ProfileNav-value']/@data-count)[2]/text()")
l.add_xpath('twitter_followers', "(//ul[@class='ProfileNav-list']/li/a/span[@class='ProfileNav-value']/@data-count)[3]/text()")
...but I'm not getting anything back:
2018-10-18 10:22:07 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-10-18 10:22:07 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://twitter.com/ADP> (referer: None)
2018-10-18 10:22:12 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://twitter.com/Workday> (referer: None)
2018-10-18 10:22:16 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://twitter.com/OracleHCM> (referer: None)
2018-10-18 10:22:16 [scrapy.core.engine] INFO: Closing spider (finished)
2018-10-18 10:22:16 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 892,
'downloader/request_count': 3,
'downloader/request_method_count/GET': 3,
'downloader/response_bytes': 199199,
'downloader/response_count': 3,
'downloader/response_status_count/200': 3,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2018, 10, 18, 10, 22, 16, 833691),
'log_count/DEBUG': 4,
'log_count/INFO': 7,
'memusage/max': 52334592,
'memusage/startup': 52334592,
'response_received_count': 3,
'scheduler/dequeued': 3,
'scheduler/dequeued/memory': 3,
'scheduler/enqueued': 3,
'scheduler/enqueued/memory': 3,
'start_time': datetime.datetime(2018, 10, 18, 10, 22, 7, 269320)}
SOLUTION: As per pwinz suggestion below, I was trying to do a text value extract ".text()" from the attribute where simply @-ing the attribute should give you the value. My final - working - solution is:
def parseTwitter(self, response):
company_name=response.meta['company_name']
print('### ### ### Inside PARSE TWITTER ### ### ###')
l=ItemLoader(item=TwitterItem(), response=response)
l.add_value('company_name', company_name)
l.add_xpath('twitter_tweets', "(//ul[@class='ProfileNav-list']/li/a/span[@class='ProfileNav-value']/@data-count)[1]")
l.add_xpath('twitter_following', "(//ul[@class='ProfileNav-list']/li/a/span[@class='ProfileNav-value']/@data-count)[2]")
l.add_xpath('twitter_followers', "(//ul[@class='ProfileNav-list']/li/a/span[@class='ProfileNav-value']/@data-count)[3]")
yield l.load_item()