Scrapy: How to extract an attribute value from a

Question

Looking at Twitter: www.twitter.com/twitter

You will see that the amount of followers are shown as 57.9M but if you hover over that value you will see the exact amount of followers.

This appears in the source as:

<span class="ProfileNav-value" data-count="57939946" data-is-compact="true">57.9M</span>

When I inspect this span on Chrome I use:

(//ul[@class='ProfileNav-list']/li/a/span[@class='ProfileNav-value']/@data-count)[3]

I am trying to extract just the attribute "data-count" using the above:

def parseTwitter(self, response):
company_name=response.meta['company_name']
l=ItemLoader(item=TwitterItem(), response=response)
l.add_value('company_name', company_name)
l.add_xpath('twitter_tweets', "(//ul[@class='ProfileNav-list']/li/a/span[@class='ProfileNav-value']/@data-count)[1]/text()")
l.add_xpath('twitter_following', "(//ul[@class='ProfileNav-list']/li/a/span[@class='ProfileNav-value']/@data-count)[2]/text()")
l.add_xpath('twitter_followers', "(//ul[@class='ProfileNav-list']/li/a/span[@class='ProfileNav-value']/@data-count)[3]/text()")

...but I'm not getting anything back:

    2018-10-18 10:22:07 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-10-18 10:22:07 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://twitter.com/ADP> (referer: None)
2018-10-18 10:22:12 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://twitter.com/Workday> (referer: None)
2018-10-18 10:22:16 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://twitter.com/OracleHCM> (referer: None)
2018-10-18 10:22:16 [scrapy.core.engine] INFO: Closing spider (finished)
2018-10-18 10:22:16 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 892,
 'downloader/request_count': 3,
 'downloader/request_method_count/GET': 3,
 'downloader/response_bytes': 199199,
 'downloader/response_count': 3,
 'downloader/response_status_count/200': 3,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2018, 10, 18, 10, 22, 16, 833691),
 'log_count/DEBUG': 4,
 'log_count/INFO': 7,
 'memusage/max': 52334592,
 'memusage/startup': 52334592,
 'response_received_count': 3,
 'scheduler/dequeued': 3,
 'scheduler/dequeued/memory': 3,
 'scheduler/enqueued': 3,
 'scheduler/enqueued/memory': 3,
 'start_time': datetime.datetime(2018, 10, 18, 10, 22, 7, 269320)}

SOLUTION: As per pwinz suggestion below, I was trying to do a text value extract ".text()" from the attribute where simply @-ing the attribute should give you the value. My final - working - solution is:

def parseTwitter(self, response):
    company_name=response.meta['company_name']
    print('### ### ### Inside PARSE TWITTER ### ### ###')

    l=ItemLoader(item=TwitterItem(), response=response)
    l.add_value('company_name', company_name)
    l.add_xpath('twitter_tweets', "(//ul[@class='ProfileNav-list']/li/a/span[@class='ProfileNav-value']/@data-count)[1]")
    l.add_xpath('twitter_following', "(//ul[@class='ProfileNav-list']/li/a/span[@class='ProfileNav-value']/@data-count)[2]")
    l.add_xpath('twitter_followers', "(//ul[@class='ProfileNav-list']/li/a/span[@class='ProfileNav-value']/@data-count)[3]")

    yield l.load_item()

See related: https://stackoverflow.com/questions/8550114 – Granitosaurus Oct 18 '18 at 11:46 — Granitosaurus, Oct 18 '18 at 11:46

score 2 · Answer 1 · answered Oct 18 '18 at 13:28

Its because data is manipulated with Javascript but Scrapy only downloads HTML but does not executes any JS/AJAX code.

When scraping with Scrapy, always disable Javascript in browser and then find what you want to scrape, and if its available, just use your selector/xpath, otherwise, inspect JS/AJAX calls on webspage to understand how it is loading data

So, to scrape number of follower

You can use following CSS Selector

.ProfileNav-item.ProfileNav-item--followers a

Scrapy code

item = {}
item["followers"] = response.css(".ProfileNav-item.ProfileNav-item--followers a").extract_first()
yield item

score 1 · Accepted Answer · answered Oct 18 '18 at 13:52

1

With respect to other answers, dynamic content is not the issue here. You are trying to get the text() from the data-count attribute. You should just be able to get the data from the @data-count.

Try this pattern:

l.add_xpath('twitter_tweets', "(//ul[@class='ProfileNav-list']/li/a/span[@class='ProfileNav -value']/@data-count)[1]")

It worked for me.

answered Oct 18 '18 at 13:52

pwinz

303
2
14

Thanks pwinz - you are correct - I will add my actual solution to the question now for future travellers. – Brian Murray Oct 18 '18 at 13:55
Glad to help @BrianMurray – pwinz Oct 21 '18 at 21:06

Scrapy: How to extract an attribute value from a

2 Answers2