urlopen doesn't appear to work depending on how input text is generated

Question

For some reason or another, it appears that depending on where I copy and paste a url from, urllib.request.urlopen won't work. For example when I copy http://www.google.com/ from the address bar and run the following script:

from urllib import request

url = "http://www.google.com/"

response = request.urlopen(url)
print(response)

I get the following error at the call to urlopen:

UnicodeEncodeError: 'ascii' codec can't encode character '\ufeff' in position 5: ordinal not in range(128)

But if I copy the url string from text on a web page or type it out by hand, it works just fine. To be clear, this doesn't work

from urllib import request

url = "http://www.google.com/"

response = request.urlopen(url)
print(response)

#url1 = "http://www.google.com/"
#
#response1 = request.urlopen(url1)
#print(response1)

but this does:

from urllib import request

#url = "http://www.google.com/"
#
#response = request.urlopen(url)
#print(response)

url1 = "http://www.google.com/"

response1 = request.urlopen(url1)
print(response1)

My suspicion is that the encoding is different in the actual address bar, and Spyder knows how to handle it, but I don't because I can't see what is actually going on.

EDIT: As requested...

print(ascii(url)) 'http://www.google.com/\ufeff'

print(ascii(url1)) 'http://www.google.com/'

Indeed the strings are different.

To see the difference in your strings, post the results of `print(ascii(url))` and `print(ascii(url1))`, because visually, there is no difference in your code. — Mark Tolonen, Mar 10 '18 at 17:45
Edited the question to show print(ascii(url)) and print(ascii(url1)). Upon research this is the byte order mark. For some reason when I copy from my browser's address bar, it adds the BOM, but now when I copy from web page text. — jammertheprogrammer, Mar 10 '18 at 20:43

score 1 · Answer 1 · answered Mar 10 '18 at 17:40

1

\ufeff is a zero-width non-breaking space, so it's no wonder you can't see it. Yes, there's an invisible character in your URL. At least it's not gremlins.

answered Mar 10 '18 at 17:40

Martin Sand Christensen

205
2
7

quantik · Accepted Answer · 2018-03-10T20:52:38.257

0

You could try

from urllib import request

url = "http://www.google.com/"

response = request.urlopen(url.decode("utf-8-sig").replace("\ufeff", ""))
print(response)

That Unicode character is the byte-order mark or BOM (more info here) you can encode without BOM by using utf-8-sig decoding then replacing the problematic character

edited Mar 10 '18 at 20:52

answered Mar 10 '18 at 17:43

quantik

776
12
26

Thank you for the response. I tried that and I get another error: "AttributeError: 'bytes' object has no attribute 'timeout'" I checked what actually gets passed into urlopen and it is the following: b'http://www.google.com/\xef\xbb\xbf' – jammertheprogrammer Mar 10 '18 at 20:35
@jamesthelames I just tested it out myself and those revisions seemed to fix it – quantik Mar 10 '18 at 20:53
Copied and pasted your code with no luck. I found that simply doing url.replace("\ufeff","") worked though: from urllib import request url = "http://www.google.com/".replace("\ufeff", "") response = request.urlopen(url) print(response) – jammertheprogrammer Mar 10 '18 at 23:48
@jamesthelames weird it worked for me. I guess it's difficult to replicate exactly because I can't copy and paste the url the same way you could. Glad you got it to work though. – quantik Mar 10 '18 at 23:51
Yeah, it is a weird problem indeed. Marked as answer for getting me pretty much 95% of the way there and explaining the BOM. Thanks again for your help. – jammertheprogrammer Mar 10 '18 at 23:53

urlopen doesn't appear to work depending on how input text is generated

2 Answers2