1

Looking at Twitter: www.twitter.com/twitter

You will see that the amount of followers are shown as 57.9M but if you hover over that value you will see the exact amount of followers.

This appears in the source as:

<span class="ProfileNav-value" data-count="57939946" data-is-compact="true">57.9M</span>

When I inspect this span on Chrome I use:

(//ul[@class='ProfileNav-list']/li/a/span[@class='ProfileNav-value']/@data-count)[3]

I am trying to extract just the attribute "data-count" using the above:

def parseTwitter(self, response):
company_name=response.meta['company_name']
l=ItemLoader(item=TwitterItem(), response=response)
l.add_value('company_name', company_name)
l.add_xpath('twitter_tweets', "(//ul[@class='ProfileNav-list']/li/a/span[@class='ProfileNav-value']/@data-count)[1]/text()")
l.add_xpath('twitter_following', "(//ul[@class='ProfileNav-list']/li/a/span[@class='ProfileNav-value']/@data-count)[2]/text()")
l.add_xpath('twitter_followers', "(//ul[@class='ProfileNav-list']/li/a/span[@class='ProfileNav-value']/@data-count)[3]/text()")

...but I'm not getting anything back:

    2018-10-18 10:22:07 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-10-18 10:22:07 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://twitter.com/ADP> (referer: None)
2018-10-18 10:22:12 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://twitter.com/Workday> (referer: None)
2018-10-18 10:22:16 [scrapy.core.engine] DEBUG: Crawled (200) <GET https://twitter.com/OracleHCM> (referer: None)
2018-10-18 10:22:16 [scrapy.core.engine] INFO: Closing spider (finished)
2018-10-18 10:22:16 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'downloader/request_bytes': 892,
 'downloader/request_count': 3,
 'downloader/request_method_count/GET': 3,
 'downloader/response_bytes': 199199,
 'downloader/response_count': 3,
 'downloader/response_status_count/200': 3,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2018, 10, 18, 10, 22, 16, 833691),
 'log_count/DEBUG': 4,
 'log_count/INFO': 7,
 'memusage/max': 52334592,
 'memusage/startup': 52334592,
 'response_received_count': 3,
 'scheduler/dequeued': 3,
 'scheduler/dequeued/memory': 3,
 'scheduler/enqueued': 3,
 'scheduler/enqueued/memory': 3,
 'start_time': datetime.datetime(2018, 10, 18, 10, 22, 7, 269320)}

SOLUTION: As per pwinz suggestion below, I was trying to do a text value extract ".text()" from the attribute where simply @-ing the attribute should give you the value. My final - working - solution is:

def parseTwitter(self, response):
    company_name=response.meta['company_name']
    print('### ### ### Inside PARSE TWITTER ### ### ###')

    l=ItemLoader(item=TwitterItem(), response=response)
    l.add_value('company_name', company_name)
    l.add_xpath('twitter_tweets', "(//ul[@class='ProfileNav-list']/li/a/span[@class='ProfileNav-value']/@data-count)[1]")
    l.add_xpath('twitter_following', "(//ul[@class='ProfileNav-list']/li/a/span[@class='ProfileNav-value']/@data-count)[2]")
    l.add_xpath('twitter_followers', "(//ul[@class='ProfileNav-list']/li/a/span[@class='ProfileNav-value']/@data-count)[3]")

    yield l.load_item()

2 Answers2

2

Its because data is manipulated with Javascript but Scrapy only downloads HTML but does not executes any JS/AJAX code.

When scraping with Scrapy, always disable Javascript in browser and then find what you want to scrape, and if its available, just use your selector/xpath, otherwise, inspect JS/AJAX calls on webspage to understand how it is loading data

So, to scrape number of follower

You can use following CSS Selector

.ProfileNav-item.ProfileNav-item--followers a

Scrapy code

item = {}
item["followers"] = response.css(".ProfileNav-item.ProfileNav-item--followers a").extract_first()
yield item
Umair Ayub
  • 19,358
  • 14
  • 72
  • 146
1

With respect to other answers, dynamic content is not the issue here. You are trying to get the text() from the data-count attribute. You should just be able to get the data from the @data-count.

Try this pattern:

l.add_xpath('twitter_tweets', "(//ul[@class='ProfileNav-list']/li/a/span[@class='ProfileNav -value']/@data-count)[1]")

It worked for me.

pwinz
  • 303
  • 2
  • 14