2

I'm writing a script to get information regarding buildings in NYC. I know that my code works and returns what i'd like it to. I was previously doing manual entry and it worked. Now i'm trying to have it read addresses from a text file and access the website with that information and i'm getting this error:

urllib.error.HTTPError: HTTP Error 400: Bad Request

I believe it has something to do with the website not liking lots of access from something that isn't a browser. I've heard something about User Agents but don't know how to use them. Here is my code:

from bs4 import BeautifulSoup
import urllib.request

f = open("FILE PATH GOES HERE")

def getBuilding(link):
    r = urllib.request.urlopen(link).read()
    soup = BeautifulSoup(r, "html.parser")
    print(soup.find("b",text="KEYWORDS IM SEARCHING FOR GO HERE:").find_next("td").text)


def main():
    for line in f:
        num, name = line.split(" ", 1)
        newName = name.replace(" ", "+")
        link = "LINK GOES HERE (constructed from num and newName variables)"
        getBuilding(link)      
    f.close()

if __name__ == "__main__":
    main()
Harrison
  • 5,095
  • 7
  • 40
  • 60
  • The fact that you've run the code in isolation makes me doubt the server is stopping the request based on solely on your User Agent. More likely is that is rate limiting your client or a bug in how you've constructed your request... can you please put the real code in for your link and a sample line from your file? – Peter Brittain Jun 18 '16 at 22:34
  • I'll get back to you with that tomorrow morning! – Harrison Jun 19 '16 at 03:42

1 Answers1

3

A 400 error means that the server cannot understand your request (e.g., malformed syntax). That said, its up to the developers on what status code they want to return and, unfortunately, not everyone strictly follows their intended meaning.

Check out this page for more details on HTTP Status Codes.

With regards on how to how to set a User Agent: A user agent is set in the request header and, basically, defines the client making the request. Here is a list of recognized User Agents. You will need to use urllib2, rather than urllib, but urllib2 is also a built-in package. I will show you how update the getBuilding function to set the header using that module. But I would recommend checking out the requests library. I just find that to be super straight-forward and it is highly adopted/supported.

Python 2:

from urllib2 import Request, urlopen

def getBuilding(link):        
    q = Request(link)
    q.add_header('User-Agent', 'Mozilla/5.0')
    r = urlopen(q).read()
    soup = BeautifulSoup(r, "html.parser")
    print(soup.find("b",text="KEYWORDS IM SEARCHING FOR GO HERE:").find_next("td").text)

Python 3:

from urllib.request import Request, urlopen

def getBuilding(link):        
    q = Request(link)
    q.add_header('User-Agent', 'Mozilla/5.0')
    r = urlopen(q).read()
    soup = BeautifulSoup(r, "html.parser")
    print(soup.find("b",text="KEYWORDS IM SEARCHING FOR GO HERE:").find_next("td").text)

Note: The only difference between Python v2 and v3 is the import statement.

Akshay
  • 783
  • 6
  • 20
Jordan Bonitatis
  • 1,527
  • 14
  • 12
  • It's telling me there's no module named urllib2. That would be due to me using python 3 correct? – Harrison Jun 19 '16 at 16:40
  • Yup - I updated my answer to demonstrate both Python 2 and 3 import statements. Alternatively, you could do something like the solution offered by @cees-timmerman [here](http://stackoverflow.com/questions/7933417/how-do-i-set-headers-using-pythons-urllib/24870196#24870196) to have an import statement compatible w/ both versions – Jordan Bonitatis Jun 19 '16 at 19:18
  • ImportError: cannot import name 'Request'? – Harrison Jun 20 '16 at 02:03
  • I've tested by printing the link before it passes to the function and for some reason the link is being formed on 2 separate lines... http://puu.sh/pyXbk/cbc6e814b6.png – Harrison Jun 20 '16 at 02:11
  • 1
    I SOLVED IT! The bad request was due to the link being formed with incorrect syntax. I had to strip one of the strings that created the link because there was trailing white space which resulted in the link being 2 lines long. – Harrison Jun 20 '16 at 02:16
  • Nice - glad you figured it out! Its always worthwhile to try to pay attention to the response code. In this case, it was accurately relaying that there was a syntax error. Nice work :-) – Jordan Bonitatis Jun 20 '16 at 03:24