I have a webcrawler, but currently the 404 error occurs when calling requests.get(url)
from the requests module. But the URL is reachable.
base_url = "https://www.blogger.com/profile/"
site = base_url + blogs_to_visit.pop().rsplit('/', 1)[-1]
r = requests.get(site)
soup = BeautifulSoup(r.content, "html.parser")
# Printing some values for debugging
>>> print site
https://www.blogger.com/profile/01785989747304686024
>>> print r
<Response [404]>
However, if I hardcore the string site
for the requests module as the exact same string. The response is 202.
site = "https://www.blogger.com/profile/01785989747304686024"
# Printing some values for debugging
>>> print site
https://www.blogger.com/profile/01785989747304686024
>>> print r
<Response [202]>
What just striked me is that it looks like a hidden newline after printing site
the first time, might that be what's causing the problem?
The URL's to visit is earlier stored in a file with;
for link in soup.select("h2 a[href]"):
blogs.write(link.get("href") + "\n")
and fetched with
with open("foo") as p:
return p.readlines()
The question is then, what would be a better way of writing them to the file? If I dont seperate them with "\n" for eg, all the URL's are glued together as one.