0

I want to access the content of a web page, but I'm being redirected to another page even though I've set allow_redirects to False in my requests call. Here's an example code snippet:

import requests
from bs4 import BeautifulSoup

headers = {'User-Agent': user_agent} # assume I inserted my user agent here
URL = "https://stackoverflow.com/questions/73909641/program-is-about-space-utilisation-i-am-getting-error-72g-value-too-great-for"
html_content = requests.get(URL, allow_redirects=False, headers = headers)
soup = BeautifulSoup(html_content.content, "html.parser")

When I run this code, I don't get any content from the web page. However, if I set allow_redirects to True, I'm redirected to this URL: Convert between byte count and "human-readable" string.

baduker
  • 19,152
  • 9
  • 33
  • 56
edyvedy13
  • 2,156
  • 4
  • 17
  • 39
  • Yes, oh okay, I need to log in first then – edyvedy13 Mar 18 '23 at 13:02
  • 1
    Why are you doing this particularly? Are you aware that [so] has [Data dumps](https://meta.stackexchange.com/questions/19579/where-are-the-stack-exchange-data-dumps)? Or that it has an [API](https://api.stackexchange.com/docs)? Why insist on scraping data that is readily available? – Abdul Aziz Barkat Mar 18 '23 at 13:07
  • Does this answer your question? [Http Redirection code 3XX in python requests](https://stackoverflow.com/questions/22150023/http-redirection-code-3xx-in-python-requests) Tl;dr you misunderstand how `allow_redirects` works, it can't prevent the server from sending redirects to you, all it does is that it stops `requests` from following the redirect. – Abdul Aziz Barkat Mar 18 '23 at 13:11
  • Actually, there are rate limits for API and it is incredibly slow to download from archieve.org – edyvedy13 Mar 18 '23 at 14:15

1 Answers1

0

You'd have to log in to get to the original SO question because

anonymous users get automatically redirected to the duplicate target when trying to access questions closed as duplicates with no answers

The relevant Meta Post, which is a duplicate itself of this question.

You can replicate this switching to Private Mode in your browser and then opening this link:

Program is about space utilisation. i am getting error : 72G value too great for base (error token is "72")

You should get redirected to Convert between byte count and "human-readable" string

EDIT:

You can turn off this behaviour and get to the original post with rquests by appending the URL with:

?noredirect=1

Here's an example:

import requests
from bs4 import BeautifulSoup

headers = {
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36",
}

URL = "https://stackoverflow.com/questions/73909641/program-is-about-space-utilisation-i-am-getting-error-72g-value-too-great-for?noredirect=1"
html_content = requests.get(URL, headers=headers)
soup = BeautifulSoup(html_content.content, "html.parser").select_one("title")
print(soup)

Output:

<title>linux - Program is about space utilisation. i am getting error : 72G value too great for base (error token is "72") - Stack Overflow</title>
baduker
  • 19,152
  • 9
  • 33
  • 56