0

Response from the actual destination URL is needed.

I have tried solution mentioned SO question.

import requests
doi_link  = 'https://doi.org/10.1016/j.artint.2018.07.007'
response = requests.get(url= doi_link ,allow_redirects=True )
print(response.status_code,response.url, response.history)
#Outputs: 200 https://linkinghub.elsevier.com/retrieve/pii/S0004370218305988 [<Response [302]>]

Why is allow_redirects getting stopped in the middle?

The final URL i get on when done manually on browser is https://www.sciencedirect.com/science/article/pii/S0004370218305988?via%3Dihub

I wanted to have this URL programmatically.

EDIT As suggested in comments the final call to the destination is made using JS.

Tushar Tiwari
  • 99
  • 1
  • 4
  • 14
  • Does this answer your question? [Python Requests library redirect new url](https://stackoverflow.com/a/20475712/9610015) – Hussain Hassam Feb 01 '22 at 11:04
  • If you open the network tools in your browser, you'll see that the last URL is being redirected to with Javascript. You'll have to either parse the response from your `requests` call to get the new URL to redirect to or use something like Selenium so that the browser will handle the redirect. – D Malan Feb 01 '22 at 12:06
  • @HussainHassam There are no accepted answers there. Which one are you referring to? – Tushar Tiwari Feb 01 '22 at 12:48
  • @DMalan Yes i checked the network tools in browser, Could you please point out how to parse requests for redirect URL? I do not want to use Selenium or browser for this. – Tushar Tiwari Feb 01 '22 at 12:50

1 Answers1

2

As suggested here: Python Requests library redirect new url

You can use the response history to get the final URL. In this case, the final URL will return a 200, however, it will have the "final final" redirect in the HTML. You can parse the final HTML to get the redirectURL.

I would use something like beautifulsoup4 to make parsing very easy - pip install beautifulsoup4

import requests
from bs4 import BeautifulSoup
from urllib.request import unquote
from html import unescape

doi_link  = 'https://doi.org/10.1016/j.artint.2018.07.007'
response = requests.get(url= doi_link ,allow_redirects=True )
for resp in response.history:
     print(resp.status_code, resp.url)

# use final response
# parse html and get final redirect url
soup = BeautifulSoup(response.text, 'html.parser')
redirect_url = soup.find(name="input" ,attrs={"name":"redirectURL"})["value"]

# get final response. unescape and unquote url from the HTML
final_url = unescape(unquote(redirect_url))
print(final_url)
article_resp = requests.get(final_url)

Hussain Hassam
  • 319
  • 1
  • 5
  • You don't have to do another request to the final URL (at line `page_text = requests.get(response.url).text`) as far as I know. The `requests` library would already have fetched it for you since you passed in `allow_redirects=True`. – D Malan Feb 02 '22 at 08:14
  • @HussainHassam Thanks. Got the final URL. Not sure why the `article_resp` returned a ``. – Tushar Tiwari Feb 02 '22 at 08:22
  • @DMalan you're absolutely right. i've updated the code to reflect that. thanks! – Hussain Hassam Feb 02 '22 at 18:00