3

I'm writing a script to find out which full URLs a large number of shortened URLs lead to. I'm using the requests module to follow redirects and get the URL one would end up at if entering the URL in a browser. This works for almost all link shorteners, but fails for URLs form disq.us for reasons I can't figure out (i.e. for disq.us URL's I get the same url I enter, whereas when I enter it in a browser, I get redirected)

Below is a snippet which correctly resolves a bit.ly-shortened link but fails with a disq.us-link. I run it with Python 3.6.4 and version 2.18.4 of the requests module. SO will not allow me to include shortened URLs in the question, so I'll leave those in a comment.

import requests

user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.131 Safari/537.36'

url1 = "SOME BITLY URL"
url2 = "SOME DISQ.US URL"

for url in [url1, url2]:
    s = requests.Session()
    s.headers['User-Agent'] = user_agent
    r = s.get(url, allow_redirects=True, timeout=10)
    print(r.url)
bjarkemoensted
  • 2,557
  • 3
  • 24
  • 37

1 Answers1

5

Your first URL is a 404 for me. Interestingly, I just tried this with the second url and it worked, but I used a different user agent. Then I tried it with your user agent, and it isn't redirecting.

This suggests that the webserver is doing something strange in response to that user agent string, and that the problem isn't with requests.

>>> import requests
>>> user_agent = 'foo'
>>> url = 'THE_DISCUS_URL'
>>> s = requests.Session()
>>> s.headers['User-Agent'] = user_agent
>>> r = s.get(url, allow_redirects=True, timeout=10)
>>> r.url
'https://www.elsevier.com/connect/could-dissolvable-microneedles-replace-injected-vaccines'

vs.

>>> import requests
>>> user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_9_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/34.0.1847.131 Safari/537.36'
>>> url = 'THE_DISCUS_URL'
>>> s = requests.Session()
>>> s.headers['User-Agent'] = user_agent
>>> r = s.get(url, allow_redirects=True, timeout=10)
>>> r.url
'THE_DISCUS_URL'

I got curious, so I investigated a little more. The actual content of the response is a noscript tag with the link, and some javascript that does the redirect.

What's probably going on here is that if discus sees a real webbrowser user agent, it tries to redirect via javascript (and probably do a bunch of tracking in the process). On the other hand, if the user agent isn't familiar, the site assumes the visitor is a script, which probably can't do javascript, and just redirects.

Julian
  • 2,483
  • 20
  • 20
  • Could you add the user agent you're using? – bjarkemoensted May 01 '18 at 04:08
  • Done. Apparently you can't post discus urls here directly? I was wondering why you used the pastebin! – Julian May 01 '18 at 04:14
  • no, my question got rejected because shortened URL's aren't allowed. – bjarkemoensted May 01 '18 at 04:19
  • I updated with an explanation. I can't be 100% confident it's right without more investigating than I feel like doing, but I'm reasonably sure it's the right idea. – Julian May 01 '18 at 04:19
  • riiight, it's because requests doesn't run the javascript! Of course! The annoying thing is that the header is necessary for some of the other redirects to work. Found another question on this, so marking this as dublicate. https://stackoverflow.com/questions/41352373/how-to-python-requests-to-follow-url-like-my-browser – bjarkemoensted May 01 '18 at 04:57