0
import requests

def extractlink():
    with open('extractlink.txt', 'r') as g:
        print("opened extractlink.txt for reading")
        contents = g.read()
        headers = {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/67.0.3396.99 Safari/537.36'}
        r = requests.get(contents, headers=headers)
        print(("Links to " + r.url))
        time.sleep (2)

Currently, r.url is just linking to the url found in 'extractlink.txt'

I'm looking to fix this script to find the final redirected url and print the result. It appears the issue lies somewhere in the request for the URL, despite trying many alternatives and troubleshooting steps, my issue doesn't seem to be solved like the rest.

When debugging, r.history reads as [] and r.status_code reads as 403 even though the link redirects as a 302 in browser. Any ideas?

(extractlink.txt is just a one line file with a link to http://butterup.teechip.icu/, enter with your own caution, spam website)

Dann
  • 159
  • 6
  • 2
    `requests.get` should follow redirects by default. Can you include `r.history`? – Cole Dec 17 '18 at 23:49
  • Where should `r.history` be included? I assume you mean just printing to debug? Thanks @Cole – Dann Dec 17 '18 at 23:51
  • print(r.history) – nicholishen Dec 17 '18 at 23:52
  • `r.history` is reading as [] – Dann Dec 17 '18 at 23:53
  • `r.status_code` also reads as 403, the url should have a 302 redirect. – Dann Dec 18 '18 at 00:05
  • Hey, I'm new to stackoverflow, Is there any reason my posts keep getting downvoted? I've tried my best to provide solid information/goals and supply all the code necessary. – Dann Dec 18 '18 at 02:49
  • 1
    @Dansey Forget the downvotes, they are and should be anonymous. The motivation to give an upvote/Downvote can be up to subjective so do not take much importance. If you feel it is the best question you can ask then it's okay, but then improve your question. Read the [ask] guide and pass the [tour] if you have not already done so. – eyllanesc Dec 18 '18 at 02:59
  • @Dansey If you want help in SO you must provide a [mcve]. – eyllanesc Dec 18 '18 at 03:37
  • @eyllanesc I don’t know what more you could want, I’ve provided everything aside from the link (which redirects to a spam/malicious site); I will not bring more attention to such URLs. As I’ve said, the output is a valid url that works in the browser according to print(r.history): this should easily be recreated. – Dann Dec 18 '18 at 04:32
  • @Dansey Well then it provides other urls that are not malicious :-). Many times in these cases there is no universal answer as they depend on the specific url so I recommend you provide the urls warning of the danger but probably not receive help – eyllanesc Dec 18 '18 at 04:45
  • @Dansey for example the redirection can be done by js and consequently requests library would not work, you should use selenium if so. that typically occurs in the links that after n seconds redirects you – eyllanesc Dec 18 '18 at 04:47
  • `r.url` is the final redirected url! `r.history` is empty list means you did not be redirected. If you access successfully in browser so the issue why you did not be redirected is you send an bad request and server responded 403 instead of 302. – KC. Dec 18 '18 at 04:57
  • So my suggestion is add your example url. Or you try to figure what lack of in your request by yourself. Btw, there are many reasons that make you get unexpected response(cloudflare or etc.). If you do not provide example url, i can only give your suggestion instead answer. – KC. Dec 18 '18 at 05:03
  • No problem, @kcorlidy. Here is the url, enter with your own caution. `http://butterup.teechip.icu/` – Dann Dec 18 '18 at 05:18
  • I test with your code and url, and url responded 302. And i was redirected to `http://newtshirtshop.com/buckle-up-butter-up`. – KC. Dec 18 '18 at 05:27
  • I noticed your code: `contents = g.read()`. Can you print it and see whether it is a legal http url. Btw, when you accessed `http://butterup.teechip.icu/` and you got 403? – KC. Dec 18 '18 at 05:33
  • Yes, for me, the script returns a 403. – Dann Dec 18 '18 at 05:40
  • Oddly enough, printing g.read() ended up blank! Any ideas behind why? @kcorlidy – Dann Dec 18 '18 at 05:46
  • Forgive my rude suggestion. Can you try to access url through hardcode instead of read file(i wonder whether error occurred on requests or read file). If it occurred on read file, read https://stackoverflow.com/questions/16374425/python-read-function-returns-empty-string – KC. Dec 18 '18 at 05:59
  • @kcorlidy, I've researched the problem and came across that link already, I've added the code into my script yet the read still comes out blank. – Dann Dec 18 '18 at 07:03
  • @kcorlidy Hardcoding the url still results in the final url being the same. – Dann Dec 18 '18 at 07:05
  • Is there any solution? – Dann Dec 18 '18 at 22:40

1 Answers1

0

HTTP Status code 403 symbolizes that you are unauthorized to view the endpoint. This means either you need to log in or you might be missing some headers. You can check the headers used by the browser from network tab of insect element. Try using the same header as the browser.