Simple Python social media scrape of Public information

Question

I just want to grab public information from my accounts on two social media sites. (Instagram and Twitter) My code returns info for twitter, and I know the xpath is correct for instagram but for some reason i'm not getting data for it. I know the XPATH's could be more specific but I can fix that later. Both my accounts are public.

1) I thought maybe it didn't like the python header, so I tried changing it and I still get nothing. That line is commented out but its still there.

2) I heard something about an API on github, this lengthy code is very intimidating and way above my level of understanding. I don't know more than half of what i'm reading on there.

from lxml import html
import requests
import webbrowser

headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36'}

#page = requests.get('https://www.instagram.com/<my account>/', headers=headers)
page = requests.get('https://www.instagram.com/<my account>/')
tree = html.fromstring(page.text)
pageTwo = requests.get('http://www.twitter.com/<my account>')
treeTwo = html.fromstring(pageTwo.text)

instaFollowers = tree.xpath("//span[@data-reactid='.0.1.0.0:0.1.3.1.0']/span[2]/text()")

instaFollowing = tree.xpath("//span[@data-reactid='.0.1.0.0:0.1.3.2.0']/span[2]/text()")

twitFollowers = treeTwo.xpath("//a[@data-nav='followers']/span[@class='ProfileNav-value']/text()")

twitFollowing = treeTwo.xpath("//a[@data-nav='following']/span[@class='ProfileNav-value']/text()")

print ''
print '--------------------'
print 'Social Media Checker'
print '--------------------'
print ''
print 'Instagram: ' + str(instaFollowers) + ' / ' + str(instaFollowing)
print ''
print 'Twitter: ' + str(twitFollowers) + ' / ' + str(twitFollowing)

Just checked page sources of random Twitter and IG pages. While I can find the Twitter attribute `@data-nav`, I cannot IG's `@data-reactid`. By the way, IG's followers and following output in a JSON string in a Javascript ` — Parfait, Nov 15 '15 at 02:07
Using the console in Google Chrome or Firefox with the same xpath exported the result. That's how I know it works. $x("//span[@data-reactid='.0.1.0.0:0.1.3.1.0']/span[2]/text()") This is what I'm calling on instagram. '21.2k' — J. Smith, Nov 15 '15 at 05:57
See this [SO post](http://stackoverflow.com/questions/6364138/how-to-get-fully-computed-html-instead-of-source-html). Web developer tools for Chrome and FF may output fully generated HTML not the source HTML sent from server which Python's `requests.get()` may use. Those span classes may be dynamically generated by JavaScript functions then rendered to browser. Might have to send [post](http://stackoverflow.com/questions/2018026/should-i-use-urllib-urllib2-or-requests) params? — Parfait, Nov 15 '15 at 15:26

score 1 · Accepted Answer · answered Nov 29 '15 at 17:55

As mentioned, Instragram's page source does not reflect its rendered source as a Javascript function is called to pass content from JSON data to browser. Hence, what Python scrapes in page source does not show exactly what browser renders to screen. Welcome to the new world of dynamic web programming! Consider using Instagram's API or other web parser that can retrieve html generated content (not just page source).

With that said, if you simply need the IG account data you can still use Python's lxml to XPath the JSON content in <script> tag (specifically sixth occurrence but adjust to your needed page). Below example parses Google's Instagram JSON data:

import lxml.etree as et
import urllib.request as rq

rqpage = rq.urlopen('https://instagram.com/google')
txtpage = rqpage.read()

tree = et.HTML(txtpage)
jsondata = tree.xpath("//script[@type='text/javascript' and position()=6]/text()")

for i in jsondata:    
    print(i)

OUTPUT

window._sharedData = {"qs":"{\"shift\":10,\"header
\":\"n3bTdmHGHDgxvZYPN0KDFHqbkxd6zpTl\",\"edges\":100,\"blob
\":\"AQCq42rOTCnKOZcOxFn06L1J6_W8wY6ntAS1bX88VBClAjQD9PyJdefCzOwfSAbUdsBwHKb1QSndurPtjyN-
rHMOrZ_6ubE_Xpu908cyron9Zczkj4QMkAYUHIgnmmftuXG8rrFzq_Oq3BoXpQgovI9hefha-
6SAs1RLJMwMArrbMlFMLAwyd1TZhArcxQkk9bgRGT4MZK4Tk2VNt1YOKDN1pO3NJneFlUxdUJTdDX
zj3eY-stT7DnxF_GM_j6xwk1o\",\"iterations\":7,\"size\":42}","static_root":"
\/\/instagramstatic-a.akamaihd.net\/bluebar\/5829dff","entry_data":
{"ProfilePage":[{"__query_string":"?","__path":"\/google\/","__get_params":
{},"user":{"username":"google","has_blocked_viewer":false,"follows":
{"count":10},"requested_by_viewer":false,"followed_by":
{"count":977186},"country_block":null,"has_requested_viewer":false,"followed_
by_viewer":false,"follows_viewer":false,"profile_pic_url":"https:
\/\/instagram.ford1-1.fna.fbcdn.net\/hphotos-xfp1\/t51.2885-19\/s150x150
\/11910217_933356470069152_115044571_a.jpg","is_private":false,"full_name":
"Google","media":{"count":180,"page_info":
{"has_previous_page":false,"start_cursor":"1126896719808871555","end_cursor":
"1092117490206686720","has_next_page":true},"nodes":[{"code":"-
jipiawryD","dimensions":{"width":640,"height":640},"owner":
{"id":"1067259270"},"comments":{"count":105},"caption":"Today's the day! 
Your searches are served. Happy Thanksgiving \ud83c\udf57\ud83c\udf70 
#GoogleTrends","likes":
{"count":11410},"date":1448556579.0,"thumbnail_src":"https:\/
\/instagram.ford1-1.fna.fbcdn.net\/hphotos-xat1\/t51.2885-15\/e15\
/11848856_482502108621097_589421586_n.jpg","is_video":true,"id":"112689671980
8871555","display_src":"https:\/\/instagram.ford1-1.fna.fbcdn.net\/hphotos-
xat1\/t51.2885-15
...

JSON Pretty Print (extracting the window._sharedData variable from above)

See below where user (followers, following, etc.) data shows at beginning:

{
  "qs": "{\"shift\":10,\"header\":\"n3bTdmHGHDgxvZYPN0KDFHqbkxd6zpTl\",\"edges\":100,\"blob\":\"AQCq42rOTCnKOZcOxFn06L1J6_W8wY6ntAS1bX88VBClAjQD9PyJdefCzOwfSAbUdsBwHKb1QSndurPtjyN-rHMOrZ_6ubE_Xpu908cyron9Zczkj4QMkAYUHIgnmmftuXG8rrFzq_Oq3BoXpQgovI9hefha-6SAs1RLJMwMArrbMlFMLAwyd1TZhArcxQkk9bgRGT4MZK4Tk2VNt1YOKDN1pO3NJneFlUxdUJTdDXzj3eY-stT7DnxF_GM_j6xwk1o\",\"iterations\":7,\"size\":42}",
  "static_root": "\/\/instagramstatic-a.akamaihd.net\/bluebar\/5829dff",
  "entry_data": {
    "ProfilePage": [
      {
        "__query_string": "?",
        "__path": "\/google\/",
        "__get_params": {

        },
        "user": {
          "username": "google",
          "has_blocked_viewer": false,
          "follows": {
            "count": 10
          },
          "requested_by_viewer": false,
          "followed_by": {
            "count": 977186
          },
          "country_block": null,
          "has_requested_viewer": false,
          "followed_by_viewer": false,
          "follows_viewer": false,
          "profile_pic_url": "https:\/\/instagram.ford1-1.fna.fbcdn.net\/hphotos-xfp1\/t51.2885-19\/s150x150\/11910217_933356470069152_115044571_a.jpg",
          "is_private": false,
          "full_name": "Google",
          "media": {
            "count": 180,
            "page_info": {
              "has_previous_page": false,
              "start_cursor": "1126896719808871555",
              "end_cursor": "1092117490206686720",
              "has_next_page": true
            },
            "nodes": [
              {
                "code": "-jipiawryD",
                "dimensions": {
                  "width": 640,
                  "height": 640
                },
                "owner": {
                  "id": "1067259270"
                },
                "comments": {
                  "count": 105
                },
                "caption": "Today's the day! Your searches are served. Happy Thanksgiving \ud83c\udf57\ud83c\udf70 #GoogleTrends",
                "likes": {
                  "count": 11410
                },
                "date": 1448556579,
                "thumbnail_src": "https:\/\/instagram.ford1-1.fna.fbcdn.net\/hphotos-xat1\/t51.2885-15\/e15\/11848856_482502108621097_589421586_n.jpg",
                "is_video": true,
                "id": "1126896719808871555",
                "display_src": "https:\/\/instagram.ford1-1.fna.fbcdn.net\/hphotos-xat1\/t51.2885-15\/e15\/11848856_482502108621097_589421586_n.jpg"
              },
              {
                "code": "-hwbf2wr0O",
                "dimensions": {
                  "width": 640,
                  "height": 640
                },
                "owner": {
                  "id": "1067259270"
                },
                "comments": {
                  "count": 95
                },
                "caption": "Thanksgiving dinner is waiting. But first, the airport. \u2708\ufe0f #GoogleApp",
                "likes": {
                  "count": 12621
                },
...

score 0 · Answer 2 · answered Nov 28 '15 at 15:38

0

IF anyone is interested in this sort of thing still, using selenium solved my problems.

http://pastebin.com/5eHeDt3r

Is there a faster way ?

answered Nov 28 '15 at 15:38

J. Smith

13
4

Yes, there is. Use the automated system described in my answer. – Andrew Polukhin Dec 21 '20 at 20:03

score 0 · Answer 3 · answered Dec 21 '20 at 20:02

In case you want to find information about yourself and others without hassling with code, try this piece of software. Apart from automatic scraping, it analyzes and visualizes the received information on a PDF report from such social networks: Facebook, Twitter, Instagram and from the Google Search engine.

P.S. I am the main developer and maintainer of this project.

Simple Python social media scrape of Public information

3 Answers3