0

Long story short, I'm trying to create an Instagram python scraper, that loads the entire page and grabs all the links to the images. I have it working, only problem is, it only loads the original 12 photos that Instagram shows. Is there anyway I can tell requests to load the entire page?

Working code;

import json
import requests
from bs4 import BeautifulSoup
import sys

r = requests.get('https://www.instagram.com/accountName/')
soup = BeautifulSoup(r.text, 'lxml')

script = soup.find('script', text=lambda t: t.startswith('window._sharedData'))
page_json = script.text.split(' = ', 1)[1].rstrip(';')
data = json.loads(page_json)
non_bmp_map = dict.fromkeys(range(0x10000, sys.maxunicode + 1), 0xfffd)

for post in data['entry_data']['ProfilePage'][0]['graphql']['user']['edge_owner_to_timeline_media']['edges']:
    image_src = post['node']['display_url']
    print(image_src)
WeDa Beast
  • 44
  • 1
  • 2
  • 9
  • 1
    BS4 is the wrong tool for this. Since pages like Instagram have those "infinite scrolling" features, where additional content is shown when a page is scrolled to the bottom, you would need a scraper like selenium which will invoke a browser to load and do the scrolling. Try starting [here](https://stackoverflow.com/questions/21006940/how-to-load-all-entries-in-an-infinite-scroll-at-once-to-parse-the-html-in-pytho) – Scratch'N'Purr Apr 27 '18 at 08:23

4 Answers4

0

As Scratch already mentioned, Instagram uses "infinite scrolling" which won't allow you to load the entire page. But you can check the total amount of messages at the top of the page (within the first span with the _fd86t class). Then you can check if the page already contains all of the messages. Otherwise, you'll have to use a GET request to get a new JSON response. The benefit to this is that this request contains the first field, which seems to allow you to modify how many messages you get. You can modify this from its standard 12 to get all of the remaining messages (hopefully).

The necessary request looks similar to the following (where I've anonymised the actual entries, and with some help from the comments):

https://www.instagram.com/graphql/query/?query_hash=472f257a40c653c64c666ce877d59d2b&variables={"id":"XXX","first":12,"after":"XXX"}
Martijn
  • 417
  • 2
  • 8
  • Tried using selenium to scroll to the bottom as suggested by scratch. Apparently even when loaded in, Instagram doesn't show the links in the source code. Tried checking out the link you sent for queries, but have no idea how to get the query_hash. – WeDa Beast Apr 27 '18 at 11:29
  • 1
    query_hash depends of the type of request you are doing, to get user posts is `query_hash=472f257a40c653c64c666ce877d59d2b` – Pablo Gutiérrez Apr 27 '18 at 15:52
  • can you please tell what is `after` variable here? – Chiefir Oct 08 '19 at 08:14
0

parse_ig.py

from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from bs4 import BeautifulSoup
from InstagramAPI import InstagramAPI
import time

c = webdriver.Chrome()
# load IG page here, whether a hashtag or a public user's page using c.get(url)

for i in range(10):
    c.send_keys(Keys.END)
    time.sleep(1)

soup = BeautifulSoup(c.page_source, 'html.parser')
ids = [a['href'].split('/') for a in soup.find_all('a') if 'tagged' in a['href']]

Once you have the ids, you can use Instagram's old API to get data for those. I'm not sure if it still works, but there was an API that I used -- which was limited by how much FB has slowly deprecated parts of the old API. Here's the link, in case you don't want to access Instagram API on your own :)

You can also add improvements to this simple code. Like instead of a "for" loop, you could do a "while" loop instead (i.e. while page is still loading, keep pressing END button.)

zero
  • 1,605
  • 3
  • 15
  • 24
0

@zero's answer is incomplete (at least as of 1/15/19). c.send_keys is not a valid method. Instead, this is what I did:

c = webdriver.Chrome()
c.get(some_url)

element = c.find_element_by_tag_name('body') # or whatever tag you're looking to scrape from

for i in range(10):
    element.send_keys(Keys.END)
    time.sleep(1)

soup = BeautifulSoup(c.page_source, 'html.parser')
0

Here is a link to good tutorial for scraping Instagram profile info and posts that also handles pagination and works in 2022: Scraping Instagram

In summary, you have to use Instagram GraphQL API endpoint that requires user identifier and cursor from the previous page response: https://instagram.com/graphql/query/?query_id=17888483320059182&id={user_id}&first=24&after={end_cursor}

kostek
  • 801
  • 2
  • 15
  • 32