30

Months ago, Instagram began rendering their public API inoperable by removing most features and refusing to accept new applications for most permissions scopes. Further changes were made this week which further constricts developer options.

Many of us have turned to Instagram's private web API to implement the functionality we previously had. One standout ping/instagram_private_api manages to rebuild most of the prior functionality, however, with the publicly announced changes this week, Instagram also made underlying changes to their private API, requiring in magic variables, user-agents, and MD5 hashing to make web scraping requests possible. This can be seen by following the recent releases on the previously linked git repository, and the exact changes needed to continue fetching data can be seen here.

These changes include:

  • Persisting the User Agent & CSRF token between requests.
  • Making an initial request to https://instagram.com/ to grab an rhx_gis magic key from the response body.
  • Setting the X-Instagram-GIS header, which is formed by magically concatenating the rhx_gis key and query variables before passing them through an MD5 hash.

Anything less than this will result in a 403 error. These changes have been implemented successfully in the above repository, however, my attempt in JS continues to fail. In the below code, I am attempting to fetch the first 9 posts from a user timeline. The query parameters which determine this are:

  • query_hash of 42323d64886122307be10013ad2dcc44 (fetch media from the user's timeline).
  • variables.id of any user ID as a string (the user to fetch media from).
  • variables.first, the number of posts to fetch, as an integer.

Previously, this request could be made without any of the above changes by simply GETting from https://www.instagram.com/graphql/query/?query_hash=42323d64886122307be10013ad2dcc44&variables=%7B%22id%22%3A%225380311726%22%2C%22first%22%3A1%7D, as the URL was unprotected.

However, my attempt at implementing the functionality to successfully written in the above repository is not working, and I only receive 403 responses from Instagram. I'm using superagent as my requests library, in a node environment.

/*
** Retrieve an arbitrary cookie value by a given key.
*/
const getCookieValueFromKey = function(key, cookies) {
        const cookie = cookies.find(c => c.indexOf(key) !== -1);
        if (!cookie) {
            throw new Error('No key found.');
        }
        return (RegExp(key + '=(.*?);', 'g').exec(cookie))[1];
    };

/*
** Calculate the value of the X-Instagram-GIS header by md5 hashing together the rhx_gis variable and the query variables for the request.
*/
const generateRequestSignature = function(rhxGis, queryVariables) {
    return crypto.createHash('md5').update(`${rhxGis}:${queryVariables}`, 'utf8').digest("hex");
};

/*
** Begin
*/
const userAgent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_1) AppleWebKit/604.3.5 (KHTML, like Gecko) Version/11.0.1 Safari/604.3.5';

// Make an initial request to get the rhx_gis string
const initResponse = await superagent.get('https://www.instagram.com/');
const rhxGis = (RegExp('"rhx_gis":"([a-f0-9]{32})"', 'g')).exec(initResponse.text)[1];

const csrfTokenCookie = getCookieValueFromKey('csrftoken', initResponse.header['set-cookie']);

const queryVariables = JSON.stringify({
    id: "123456789",
    first: 9
});

const signature = generateRequestSignature(rhxGis, queryVariables);

const res = await superagent.get('https://www.instagram.com/graphql/query/')
    .query({
        query_hash: '42323d64886122307be10013ad2dcc44',
        variables: queryVariables
    })
    .set({
        'User-Agent': userAgent,
        'X-Instagram-GIS': signature,
        'Cookie': `rur=FRC;csrftoken=${csrfTokenCookie};ig_pr=1`
    }));

What else should I try? What makes my code fail, and the provided code in the repository above work just fine?

Update (2018-04-17)

For at least the 3rd time in a week, Instagram has again updated their API. The change no longer requires the CSRF Token to form part of the hashed signature.

The question above has been updated to reflect this.

Update (2018-04-14)

Instagram has again updated their private graphql API. As far as anyone can figure out:

  • User Agent is no longer needed to be included in the X-Instagram-Gis md5 calculation.

The question above has been updated to reflect this.

Eagl3
  • 199
  • 1
  • 15
marked-down
  • 9,958
  • 22
  • 87
  • 150
  • Have you tried add `x-requested-with` headers https://github.com/ping/instagram_private_api/blob/54427574583d33544c006c9f6a13cb6bc306a714/instagram_web_api/client.py#L226-L234 and change user agent to normal browser? – inDream Apr 12 '18 at 05:19
  • @inDream, yes, but its irrelevant because those headers are never actually added for the purposes of this question (`params` is `None`). Also, UA updated for the sake of the question to match the Python lib, but is also irrelevant provided it is kept consistent between requests. – marked-down Apr 12 '18 at 05:25
  • @ReactingToAngularVues I'm also fighting now with this changes. I have a Chrome extension that used to save media from Instagram, and so I use pure Javascript. I guess I'm stuck for good though, since it seems to be impossible to access the 'set-cookie' value. – tube-builder Apr 12 '18 at 08:15
  • 1
    Has anyone figured out at what point they start throttling and throwing 429 responses? – Jamie Chong Apr 14 '18 at 22:47
  • 3
    Hello all, I am also struggling with the instagram updates, I was getting the profile details and first 12 media from this link https://www.instagram.com/username/?__a=1 . But due to instagram new header changes, its giving 403 Forbidden response. I saw they've added X-instagram-GIS as discussed above, but couldn't get what will be the variables here for creating magic string, as there is no variables for this link. Should we take username or id as a variable. I've got the rhx_gis and csrf_token. – saurabh Apr 17 '18 at 12:00
  • @saurabh see the user_info2 method in the Python repository above for details on how to fetch that. It still works if you have a fresh rhx_gis. – marked-down Apr 21 '18 at 03:56
  • rhx_gis + X-Instagram-Gis is no longer in use. instagram has changed it to 'x-ig-www-claim' instead. – Aero Wang Aug 29 '19 at 05:23

4 Answers4

17

Values to persist

You aren't persisting the User Agent (a requirement) in the first query to Instagram:

const initResponse = await superagent.get('https://www.instagram.com/');

Should be:

const initResponse = await superagent.get('https://www.instagram.com/')
                     .set('User-Agent', userAgent);

This must be persisted in each request, along with the csrftoken cookie.

X-Instagram-GIS header generation

As your answer shows, you must generate the X-Instagram-GIS header from two properties, the rhx_gis value which is found in your initial request, and the query variables in your next request. These must be md5 hashed, as shown in your function above:

const generateRequestSignature = function(rhxGis, queryVariables) {
    return crypto.createHash('md5').update(`${rhxGis}:${queryVariables}`, 'utf8').digest("hex");
};
Alex
  • 650
  • 5
  • 18
  • Instagram has (again) changed their private API since I posted this question, so I've updated my post. However, I'm _still_ having no luck here in getting this to work. – marked-down Apr 14 '18 at 11:33
  • 3
    do you use the same useragent on initial request to `https://www.instagram.com/`? Because I cant see that in your example – Alex Apr 14 '18 at 14:08
  • i'm sorry, but true signature of your data is fed133e749c8bc81a3e6d2e049e8da9d – PirateNinja Apr 16 '18 at 12:37
  • Thank you, but what can i do with request ?__a=1? How make magic string without variables? – PirateNinja Apr 16 '18 at 20:31
  • @Alex ding ding ding, thank you! I am a very silly person for not noticing that. Do you mind if I improve your answer and award the bounty to you? – marked-down Apr 17 '18 at 03:48
  • 1
    @ReactingToAngularVues please do it:) – Alex Apr 17 '18 at 05:57
  • 5
    Actually, they have updated api again and now you you dont even need csrftoken in signature, only rhx_gis and variables... @PirateNinja use relative path instead of variables, for example for `https://www.instagram.com/durov/` it would be `/durov/` (dont forget the slashes) – Alex Apr 17 '18 at 06:08
  • 1
    @PirateNinja yep! – Alex Apr 17 '18 at 06:41
  • tnx @Alex I see this change today – sadeghpro Apr 17 '18 at 12:33
  • @Alex Do you think you could update your post to this? https://pastebin.com/raw/AdzXxeVV I have attempted to edit it, but my edit has been rejected for some ridiculous reason via peer review. – marked-down Apr 18 '18 at 01:43
  • @ReactingToAngularVues ok, except that you dont need `rur` and `ig_pr` cookies – Alex Apr 18 '18 at 07:40
  • @Alex even better, feel free to edit the answer as you see fit :) – marked-down Apr 18 '18 at 07:41
  • I'm able to now scrape the first 12 images and then the next 12 (or 50 which is what I set it to), but when I try access the next 12 I get a 403. Any reason why this would be the case? Is the request supposed to be different? @ReactingToAngularVues – Kyle Krzeski May 05 '18 at 19:30
  • I'm able to now scrape the first 12 images and then the next 12 (or 50 which is what I set it to), but when I try access the next 12 I get a 403. Any reason why this would be the case? Is the request supposed to be different? Note, this is the second request that I have to give the rhx_gis. Does it change? @ReactingToAngularVues – Kyle Krzeski May 05 '18 at 19:50
  • @WilliamHampshire What are you using to access multiple pages? I can access the first 12 just fine, but not sure how to requery for more. Thanks! – curiouscode Apr 14 '19 at 05:22
4

So in order to call instagram query you need to generate x-instagram-gis header.

To generate this header you need to calculate a md5 hash of the next string "{rhx_gis}:{path}". The rhx_gis value is stored in the source code of instagram page in the window._sharedData global js variable.

Example:
If you try to GET user info request like this https://www.instagram.com/{username}/?__a=1
You need to add http header x-instagram-gis to request which value is
MD5("{rhx_gis}:/{username}/")

This is tested and works 100%, so feel free to ask if something goes wrong.

olllejik
  • 1,404
  • 1
  • 10
  • 13
  • Hey its not working, I tried with md5("fc2e73d4fd7dddcd31d28bea5cb2df59:/username/?__a=1") – Stack Apr 20 '18 at 14:18
  • 2
    @Stack in place of /username/?__a=1 you need to use /username/ so that md5("fc2e73d4fd7dddcd31d28bea5cb2df59:/username/") will give the value 00a89418c3a4f92d5407e36116117cd9 . This value you need to place into the x-instagram-gis header of the GET request "https://www.instagram.com/username/?__a=1" (No need to say that in place of username you need to put your instagram username). Tell me if it works for you, I've just tested the same for my account – olllejik Apr 20 '18 at 20:32
  • 1
    This really seems like a hack. While this may work right now it's probably not a long time solution. – Ares May 01 '18 at 04:39
  • I'm able to now scrape the first 12 images and then the next 12 (or 50 which is what I set it to), but when I try access the next 12 I get a 403. Any reason why this would be the case? Is the request supposed to be different? Are they blocking the third request? @Stack – Kyle Krzeski May 05 '18 at 19:31
  • They have an requests limit per second, so you need to add requests throttling to your code, otherwise you risk to be blocked temporary by instagram. Try experimenting with requests rate so that in doesn't give 403 error – olllejik May 07 '18 at 09:34
  • @WilliamHampshire What are you using to access multiple pages? I can access the first 12 just fine, but not sure how to requery for more. Thanks! – curiouscode Apr 13 '19 at 23:19
  • Did this stopped working? – Moradnejad Jun 08 '22 at 06:29
2

Uhm... I don't have Node installed on my machine, so I cannot verify for sure, but looks like to me that you are missing a crucial part of the parameters in querystring, that is the after field:

const queryVariables = JSON.stringify({
    id: "123456789",
    first: 4,
    after: "YOUR_END_CURSOR"
});

From those queryVariables depend your MD5 hash, that, then, doesn't match the expected one. Try that: I expect it to work.

EDIT:

Reading carefully your code, it doesn't make much sense unfortunately. I infer that you are trying to fetch the full stream of pictures from a user's feed.

Then, what you need to do is not calling the Instagram home page as you are doing now (superagent.get('https://www.instagram.com/')), but rather the user's stream (superagent.get('https://www.instagram.com/your_user')).

Beware: you need to hardcode the very same user agent you're going to use below (and it doesn't look like you are...).

Then, you need to extract the query ID (it's not hardcoded, it changes every few hours, sometimes minutes; hardcoding it is foolish – however, for this POC, you can keep it hardcoded), and the end_cursor. For the end cursor I'd go for something like this:

const endCursor = (RegExp('end_cursor":"([^"]*)"', 'g')).exec(initResponse.text)[1];

Now you have everything you need to make the second request:

const queryVariables = JSON.stringify({
    id: "123456789",
    first: 9,
    after: endCursor
});

const signature = generateRequestSignature(rhxGis, csrfTokenCookie, queryVariables);

const res = await superagent.get('https://www.instagram.com/graphql/query/')
    .query({
        query_hash: '42323d64886122307be10013ad2dcc44',
        variables: queryVariables
    })
    .set({
        'User-Agent': userAgent,
        'Accept': '*/*',
        'Accept-Language': 'en-US',
        'Accept-Encoding': 'gzip, deflate',
        'Connection': 'close',
        'X-Instagram-GIS': signature,
        'Cookie': `rur=${rurCookie};csrftoken=${csrfTokenCookie};mid=${midCookie};ig_pr=1`
    }).send();
Gian Segato
  • 2,359
  • 3
  • 29
  • 37
  • The `after` field is not required, it merely represents a cursor that you can fetch images from. – marked-down Apr 17 '18 at 03:49
  • Given the fact that there's no documentation, yours do not work, while this one does, it looks to me that it's very much required... Point being: there's no other major thing that changes between the two solutions other than 1. the `after` field missing, 2. not using the very same UA for both calls. Change no. 2, if this doesn't fix, no. 1 is your only answer IMHO. – Gian Segato Apr 17 '18 at 07:24
  • as per Alex's answer above, the lacking element was the same UA on the initial call. `after` is a completely optional cursor property. – marked-down Apr 17 '18 at 07:36
  • @Gianluca I'm able to now scrape the first 12 images and then the next 12 (or 50 which is what I set it to), but when I try access the next 12 I get a 403. Any reason why this would be the case? Is the request supposed to be different? Note, this is the second request that I have to give the rhx_gis. Does it change? – Kyle Krzeski May 05 '18 at 19:50
  • @WilliamHampshire Each call from the second onwards outputs a new `ig_gis`, based on the `rhx_gis` and the `new_params`, and the `new_params` themselves. So, the `ig_gis` does change for each call, with the `rhx_gis` that doesn't, while the new params do – Gian Segato May 06 '18 at 08:49
0

query_hash is not constant and keep changing over time.

For example ProfilePage scripts included these scripts:

https://www.instagram.com/static/bundles/base/ConsumerCommons.js/9e645e0f38c3.js https://www.instagram.com/static/bundles/base/Consumer.js/1c9217689868.js

The hash is located in one of the above script, e.g. for edge_followed_by:

const res = await fetch(scriptUrl, { credentials: 'include' });
const rawBody = await res.text();
const body = rawBody.slice(0, rawBody.lastIndexOf('edge_followed_by'));
const hashes = body.match(/"\w{32}"/g);
// hashes[hashes.length - 2]; = edge_followed_by
// hashes[hashes.length - 1]; = edge_follow
inDream
  • 1,277
  • 10
  • 12
  • 1
    Nope. `query_hash` is constant on a per method basis. Note that it is hardcoded in the working repository I linked above. https://github.com/ping/instagram_private_api/blob/54427574583d33544c006c9f6a13cb6bc306a714/instagram_web_api/client.py#L387 – marked-down Apr 12 '18 at 04:04