0
class Crawl(webapp2.RequestHandler):    
    def get(self):      
            from google.appengine.api import urlfetch
            url = "http://www.example.com/path/to a/page" #URL with a space
            result = urlfetch.fetch(url)
            self.response.write('url: %s' % (result.status_code)) ## Outputs 400
            self.response.write(content) # Gives me 400 error page

We can't deny the fact that there are thousands of URLs that contain spaces. There is no way we can correct them one by one.

Why does urlfetch get 400 bad request error for this kind of URL which is perfectly accessible through the browser? How to overcome this?

Tabrez Ahmed
  • 2,830
  • 6
  • 31
  • 48
  • I accept that there's no way but to [escape the request path in the url](http://stackoverflow.com/a/121017/1184247). Thank you all for helping me. – Tabrez Ahmed Jun 16 '13 at 23:50

1 Answers1

4

This is caused because the URL needs to be properly encode (as discussed below). Make sure any url's with spaces are properly encoded with a %20 in place of any space.

Nate
  • 925
  • 2
  • 12
  • 23
  • I just tried that url as well, in addition to one I just created on my own site, and I still seem to be getting a response code of 200. What GAE SDK version are you using? (Not sure if that would effect it at all, just wondering if we're testing on the same platform.) – Nate Jun 16 '13 at 22:45
  • Yeah, I tried it with the one you linked, as well as http://natecollings.com/blah%20123.html (which I just made). – Nate Jun 16 '13 at 22:46
  • I downloaded SDK last week. must be the latest version. – Tabrez Ahmed Jun 16 '13 at 22:50
  • 1.8.1 was released on the 12th, but regardless, I don't think that would affect it. (Otherwise there would have previously been more questions with the same issue, I would think.) Have you tried it with the exact code I have above, to see what you get? – Nate Jun 16 '13 at 22:52
  • I checked it this way: `self.response.write(result.content)` and it gives me Google's 400 error page. – Tabrez Ahmed Jun 16 '13 at 22:54
  • Strange, works for me if I do `result.content`. Maybe post the exact code you're using in the post, so I can try it? – Nate Jun 16 '13 at 22:56
  • Added the exact handler class – Tabrez Ahmed Jun 16 '13 at 23:01
  • I'm not sure why it affects that url and not the others I tried with spaces directly (and not replaced by %20), but that url also gives me a 400 error to me **if** I just put the spaces in directly. It works if I replace the space with `%20`. – Nate Jun 16 '13 at 23:25
  • I replaced the space with the escape character for it and it suddenly works. BTW, Nate, are you certain that your first URL gave you a 200? I tried it and it gave me the exact same error I got with @Tabrez's URL. – Truerror Jun 16 '13 at 23:25
  • Yeah, I noticed the same thing, so I added it to my post. – Nate Jun 16 '13 at 23:26
  • Nah, tried it with python's standard urllib2. Same error. Only works if we use %20 instead of space. Same goes for Nate's first example URL (the one with the space). – Truerror Jun 16 '13 at 23:31
  • @Trueerror - yeah, you're right. I just retested it, and I was mistaken, it only works if you have it properly encoded. The example.com one threw me off because it's a redirect, which is why there's no error. – Nate Jun 16 '13 at 23:34
  • @Trueerror - 1.8.1. While the example.com one works for me, any others don't, without it encoded. Pretty sure that's because urlfetch follows redirects, according to the docs. So it was just giving the result of the final destination. – Nate Jun 16 '13 at 23:36
  • Thanks a lot to both of you for helping me with this. – Tabrez Ahmed Jun 16 '13 at 23:59