Response 404, but URL is reachable from python

Question

I have a webcrawler, but currently the 404 error occurs when calling requests.get(url) from the requests module. But the URL is reachable.

base_url = "https://www.blogger.com/profile/"
site = base_url + blogs_to_visit.pop().rsplit('/', 1)[-1]
r = requests.get(site)
soup = BeautifulSoup(r.content, "html.parser")

# Printing some values for debugging
>>> print site
https://www.blogger.com/profile/01785989747304686024

>>> print r
<Response [404]>

However, if I hardcore the string site for the requests module as the exact same string. The response is 202.

site = "https://www.blogger.com/profile/01785989747304686024"

# Printing some values for debugging
>>> print site
https://www.blogger.com/profile/01785989747304686024
>>> print r
<Response [202]>

What just striked me is that it looks like a hidden newline after printing site the first time, might that be what's causing the problem?

The URL's to visit is earlier stored in a file with;

for link in soup.select("h2 a[href]"):
    blogs.write(link.get("href") + "\n")

and fetched with

with open("foo") as p:
    return p.readlines()

The question is then, what would be a better way of writing them to the file? If I dont seperate them with "\n" for eg, all the URL's are glued together as one.

you could "strip" the values before you save them. pop().rsplit(..)[-1].strip() or use p.read().splitlines as suggested in @Marc L. Allen's answer. — Markon, Jun 27 '16 at 14:08
You can read lines with: `lines = [line.rstrip('\n') for line in open('filename')]` — Gal Dreiman, Jun 27 '16 at 14:10

score 1 · Accepted Answer · edited May 23 '17 at 12:08

1

In reference to Getting rid of \n when using .readlines(), perhaps use:

with open("foo") as p:
    return p.read().splitlines()

edited May 23 '17 at 12:08

Community

1
1

answered Jun 27 '16 at 14:06

Marc L. Allen

410
2
8

Thanks for the solution! – Isbister Jun 27 '16 at 14:26

score 1 · Answer 2 · answered Jun 27 '16 at 14:09

1

you can use:

r = requests.get(site.strip('\n'))

instead of:

r = requests.get(site)

answered Jun 27 '16 at 14:09

ands

1,926
16
27

Response 404, but URL is reachable from python

2 Answers2