requests.get returns 400 bad url when given a variable containing a url, but not when given a string with the same url

Question

I have a program that reads some URLs from a text file, gets the page source with requests.get, and then uses beautifulsoup4 to find some information.

f = open('inputfile.txt')
session = requests.Session()
session.headers.update({'User-Agent': 'Mozilla/5.0'})
for line in f:
    x = 0
    z = len(line)
    r = session.get(line[x:z])
    soup = bs4.BeautifulSoup(r.text, "html.parser")

This returns an HTTP 400 Bad Request - Invalid URL. However, when I do the same thing except type out the URL as a string, everything works (although I only get one URL).

f = open('inputfile.txt')
session = requests.Session()
session.headers.update({'User-Agent': 'Mozilla/5.0'})
for line in f:
    r = session.get('http://www.ExactSameUrlAsEarlier.com')
    soup = bs4.BeautifulSoup(r.text, "html.parser")

How would I fix/modify this to allow me to cycle through the multiple URLs I have in the file? Just for clarification, this is what the inputfile.txt looks like:

http://www.url1.com/something1
http://www.url2.com/something2

etc.

Thanks in advance.

if there is only one url in the `'inputfile.txt'` does it still give you a 400? Also have you logged out `line[x:z]`, just to make sure it is a valid url getting pulled out? — Jay Hamilton, Oct 15 '17 at 03:54
Yes, I have logged the output of `line[x:z]`, it returns a valid url. When I copy paste the url that `line[x:z]` contains directly into the `requests.get()` statement, it works. I have not tried with only one url in the input file, I'll try and see how that works — Ethan Graber, Oct 15 '17 at 15:25

Stuart Buckingham · Accepted Answer · 2017-10-15T05:04:45.627

0

You should loop over the lines in the file, not the filehandle. Your for loop should be:

for line in f.readlines():
    url = line.strip()

There are other ways of stripping whitespace from the line, have a look at this post: Getting rid of \n when using .readlines()

edited Oct 15 '17 at 05:04

answered Oct 15 '17 at 04:59

Stuart Buckingham

1,574
16
25

Great, this worked. My issue appeared to be that there was a stray `\n` at the end of the url. Thanks! – Ethan Graber Oct 15 '17 at 16:08

requests.get returns 400 bad url when given a variable containing a url, but not when given a string with the same url

1 Answers1