7

I'm trying to get all the links connected to each image in this webpage.

I can get all the links if I let a selenium script scroll downward until it reaches the bottom. One such link that I wish to scrape is this one.

Now, my goal here is to parse all those links using requests. What I have notice that the links that I want to parse are built using such B-uPwZsJtnB shortcode.

However, I'm trying to scrape those different shortcode available in a script tag found in page source in that webpage. There are around 600 shortcodes in that page. The script that I've created can parse only the first 70 such shortcode which ultimately can built 70 qualified links.

How can I grab all 600 links using requests?

I've tried so far with:

import re
import json
import requests

base_link = 'https://www.instagram.com/p/{}/'
lead_url = 'https://www.instagram.com/explore/tags/baltimorepizza/'

with requests.Session() as s:
    s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36'
    req = s.get(lead_url)
    script_tag = re.findall(r"window\._sharedData[^{]+(.*?);",req.text)[0]
    for item in json.loads(script_tag)['entry_data']['TagPage']:
        tag_items = item['graphql']['hashtag']['edge_hashtag_to_media']['edges']
        for elem in tag_items:
            profile_link = base_link.format(elem['node']['shortcode'])
            print(profile_link)
0m3r
  • 12,286
  • 15
  • 35
  • 71
robots.txt
  • 96
  • 2
  • 10
  • 36

3 Answers3

2

If you want to do it with requests then please consider to query XHR/Ajax Http requests for imitating Lazy load. See the following picture:

enter image description here

You make queries to the instagram.com server similar to Scrape a JS Lazy load page by Python requests post.

Disclaimer

You might not succeed to complete that task due to some dynamic cookie values or other scraping prevention imposed by Instagram.

Community
  • 1
  • 1
Igor Savinkin
  • 5,669
  • 8
  • 37
  • 69
0

The Instagram web page uses lazy loading to load the images. You can overcome this in 2 ways:

  1. Use the Instagram API as mentioned in the comments
  2. Use a tool like selenium to load all the images on the page by scrolling to the bottom and then fetch the links

The 1st method is be the better way to do it.

BBloggsbott
  • 195
  • 2
  • 12
  • You got my question wrong @BBloggsbott. I didn't look for any better way of doing this; rather, I wished to accomplish the rest the way I started. I already got your suggestion in comment concerning the API before placing the bounty on the question. As for the selenium way, I've mentioned in my post that I went that route and found success with. Thanks. – robots.txt May 31 '20 at 09:39
0

I suggest you to use Instagram Graph API, if you are building a commercial product since using instagram public data is required the consent because of GDPR. This API will easy your work but under api limitations such as you can query 30 searches for 7 days per a user token.

If you are building non-commercial tool you have two approaches.

  1. Scrape directly the instagram web page. As mentioned in above answers you can use selenium and automate page interactions since web page uses javascript to generate image urls. The disadvantage of this method is instagram and facebook do anti scraping methods to prevent scraping their data such as wrapping html elements with dynamic generated classes, change xpaths frequently. You might have to spend lot of time to code and fix those things later.

  2. Using third party libraries that are built to scrape instagram data. There are many open source third party libraries in github and instaloader is my favorite. you can download all of hashtag search results using single command. This library not only download images but also data json of the post related to the image. Since there are maintainers to the library, you don't have to worry about later instagram web page changes. I recommend this method in your case.

Sajith Herath
  • 1,002
  • 2
  • 11
  • 24