I have the following function
import urllib.request
def seek():
web = urllib.request.urlopen("http://wecloudforyou.com/")
text = web.read().decode("utf8")
return text
texto = seek()
print(texto)
When I decode to utf-8, I get the html code with indentation and carriage returns and all, just like it's seen on the actual website.
<!DOCTYPE html>
<html>
<head>
<title>We Cloud for You |
If I remove .decode('utf8')
, I get the code, but the indentation is gone and it's replaced by \n
.
<!DOCTYPE html>\n<html>\n <head>\n <title>We Cloud for You
So, why is this happening? As far as I know, when you decode, you are basically converting some encoded string into Unicode.
My sys.stdout.encoding is CP1252 (Windows 1252 encoding)
According to this thread: Why does Python print unicode characters when the default encoding is ASCII?
Python outputs non-unicode strings as raw data, without considering its default encoding. The terminal just happens to display them if its current encoding matches the data. - Python outputs Unicode strings after encoding them using the scheme specified in sys.stdout.encoding. - Python gets that setting from the shell's environment. - the terminal displays output according to its own encoding settings. - the terminal's encoding is independant from the shell's.
So, it seems like python needs to read the text in Unicode before it can convert it to CP1252 and then it's printed on the terminal. But I don't understand why if the text is not decoded, it replaces the indentation with \n
.
sys.getdefaultencoding()
returns utf8.