1

I have a program that reads some URLs from a text file, gets the page source with requests.get, and then uses beautifulsoup4 to find some information.

f = open('inputfile.txt')
session = requests.Session()
session.headers.update({'User-Agent': 'Mozilla/5.0'})
for line in f:
    x = 0
    z = len(line)
    r = session.get(line[x:z])
    soup = bs4.BeautifulSoup(r.text, "html.parser")

This returns an HTTP 400 Bad Request - Invalid URL. However, when I do the same thing except type out the URL as a string, everything works (although I only get one URL).

f = open('inputfile.txt')
session = requests.Session()
session.headers.update({'User-Agent': 'Mozilla/5.0'})
for line in f:
    r = session.get('http://www.ExactSameUrlAsEarlier.com')
    soup = bs4.BeautifulSoup(r.text, "html.parser")

How would I fix/modify this to allow me to cycle through the multiple URLs I have in the file? Just for clarification, this is what the inputfile.txt looks like:

http://www.url1.com/something1
http://www.url2.com/something2

etc.

Thanks in advance.

  • if there is only one url in the `'inputfile.txt'` does it still give you a 400? Also have you logged out `line[x:z]`, just to make sure it is a valid url getting pulled out? – Jay Hamilton Oct 15 '17 at 03:54
  • Yes, I have logged the output of `line[x:z]`, it returns a valid url. When I copy paste the url that `line[x:z]` contains directly into the `requests.get()` statement, it works. I have not tried with only one url in the input file, I'll try and see how that works – Ethan Graber Oct 15 '17 at 15:25

1 Answers1

0

You should loop over the lines in the file, not the filehandle. Your for loop should be:

for line in f.readlines():
    url = line.strip()

There are other ways of stripping whitespace from the line, have a look at this post: Getting rid of \n when using .readlines()

Stuart Buckingham
  • 1,574
  • 16
  • 25