0

I have a webcrawler, but currently the 404 error occurs when calling requests.get(url) from the requests module. But the URL is reachable.

base_url = "https://www.blogger.com/profile/"
site = base_url + blogs_to_visit.pop().rsplit('/', 1)[-1]
r = requests.get(site)
soup = BeautifulSoup(r.content, "html.parser")

# Printing some values for debugging
>>> print site
https://www.blogger.com/profile/01785989747304686024

>>> print r
<Response [404]>

However, if I hardcore the string site for the requests module as the exact same string. The response is 202.

site = "https://www.blogger.com/profile/01785989747304686024"

# Printing some values for debugging
>>> print site
https://www.blogger.com/profile/01785989747304686024
>>> print r
<Response [202]>

What just striked me is that it looks like a hidden newline after printing site the first time, might that be what's causing the problem?

The URL's to visit is earlier stored in a file with;

for link in soup.select("h2 a[href]"):
    blogs.write(link.get("href") + "\n")

and fetched with

with open("foo") as p:
    return p.readlines()

The question is then, what would be a better way of writing them to the file? If I dont seperate them with "\n" for eg, all the URL's are glued together as one.

Isbister
  • 906
  • 1
  • 12
  • 30
  • you could "strip" the values before you save them. pop().rsplit(..)[-1].strip() or use p.read().splitlines as suggested in @Marc L. Allen's answer. – Markon Jun 27 '16 at 14:08
  • You can read lines with: `lines = [line.rstrip('\n') for line in open('filename')]` – Gal Dreiman Jun 27 '16 at 14:10

2 Answers2

1

In reference to Getting rid of \n when using .readlines(), perhaps use:

with open("foo") as p:
    return p.read().splitlines()
Community
  • 1
  • 1
Marc L. Allen
  • 410
  • 2
  • 8
1

you can use:

r = requests.get(site.strip('\n'))

instead of:

r = requests.get(site)
ands
  • 1,926
  • 16
  • 27