I am reading a HTML file, saved in local disk, using requests and a LocalFileAdapter, as per @b1r3k on Python requests fetch a file from a local url. the relevant part is:requests_session = requests.session()
requests_session.mount('file://', LocalFileAdapter())
ra=requests_session.get('file://X:\somefile.htm')
print ra.content
.
I am getting some junk on console . If you see closely, it is actual alphabets separated by these weird squares. What can I do to make it human-readable?
Asked
Active
Viewed 272 times
0
-
Looks like you have *compressed* data. Are you certain that `open('X:\\somefile.htm', 'rb').read()` doesn't produce the exact same data? – Martijn Pieters Sep 23 '15 at 12:04
-
@Martijn: No it is not compressed data. It is run of the mill HTML: [snip] – Pradeep Sep 23 '15 at 13:57
-
Ah, you have UTF-16-encoded data and the `\x00` bytes are displayed as square blocks. – Martijn Pieters Sep 23 '15 at 14:01
1 Answers
1
You have UTF-16 encoded data, with a BOM, printed out to a console that uses the Windows 1252 codepage:
>>> contents = u'''\
... <html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40"><head>
... <meta http-equiv="Content-Type" content="text/html; charset=unicode">
... <meta name="ProgId" content="Word.Document">[snip]
... </body></html>'''
>>> contents.encode('utf16').decode('cp1252')
u'\xff\xfe<\x00h\x00t\x00m\x00l\x00 \x00x\x00m\x00l\x00n\x00s\x00:\x00v\x00=\x00"\x00u\x00r\x00n\x00:\x00s\x00c\x00h\x00e\x00m\x00a\x00s\x00-\x00m\x00i\x00c\x00r\x00o\x00s\x00o\x00f\x00t\x00-\x00c\x00o\x00m\x00:\x00v\x00m\x00l\x00"\x00 \x00x\x00m\x00l\x00n\x00s\x00:\x00o\x00=\x00"\x00u\x00r\x00n\x00:\x00s\x00c\x00h\x00e\x00m\x00a\x00s\x00-\x00m\x00i\x00c\x00r\x00o\x00s\x00o\x00f\x00t\x00-\x00c\x00o\x00m\x00:\x00o\x00f\x00f\x00i\x00c\x00e\x00:\x00o\x00f\x00f\x00i\x00c\x00e\x00"\x00 \x00x\x00m\x00l\x00n\x00s\x00:\x00w\x00=\x00"\x00u\x00r\x00n\x00:\x00s\x00c\x00h\x00e\x00m\x00a\x00s\x00-\x00m\x00i\x00c\x00r\x00o\x00s\x00o\x00f\x00t\x00-\x00c\x00o\x00m\x00:\x00o\x00f\x00f\x00i\x00c\x00e\x00:\x00w\x00o\x00r\x00d\x00"\x00 \x00x\x00m\x00l\x00n\x00s\x00:\x00m\x00=\x00"\x00h\x00t\x00t\x00p\x00:\x00/\x00/\x00s\x00c\x00h\x00e\x00m\x00a\x00s\x00.\x00m\x00i\x00c\x00r\x00o\x00s\x00o\x00f\x00t\x00.\x00c\x00o\x00m\x00/\x00o\x00f\x00f\x00i\x00c\x00e\x00/\x002\x000\x000\x004\x00/\x001\x002\x00/\x00o\x00m\x00m\x00l\x00"\x00 \x00x\x00m\x00l\x00n\x00s\x00=\x00"\x00h\x00t\x00t\x00p\x00:\x00/\x00/\x00w\x00w\x00w\x00.\x00w\x003\x00.\x00o\x00r\x00g\x00/\x00T\x00R\x00/\x00R\x00E\x00C\x00-\x00h\x00t\x00m\x00l\x004\x000\x00"\x00>\x00<\x00h\x00e\x00a\x00d\x00>\x00\n\x00<\x00m\x00e\x00t\x00a\x00 \x00h\x00t\x00t\x00p\x00-\x00e\x00q\x00u\x00i\x00v\x00=\x00"\x00C\x00o\x00n\x00t\x00e\x00n\x00t\x00-\x00T\x00y\x00p\x00e\x00"\x00 \x00c\x00o\x00n\x00t\x00e\x00n\x00t\x00=\x00"\x00t\x00e\x00x\x00t\x00/\x00h\x00t\x00m\x00l\x00;\x00 \x00c\x00h\x00a\x00r\x00s\x00e\x00t\x00=\x00u\x00n\x00i\x00c\x00o\x00d\x00e\x00"\x00>\x00\n\x00<\x00m\x00e\x00t\x00a\x00 \x00n\x00a\x00m\x00e\x00=\x00"\x00P\x00r\x00o\x00g\x00I\x00d\x00"\x00 \x00c\x00o\x00n\x00t\x00e\x00n\x00t\x00=\x00"\x00W\x00o\x00r\x00d\x00.\x00D\x00o\x00c\x00u\x00m\x00e\x00n\x00t\x00"\x00>\x00[\x00s\x00n\x00i\x00p\x00]\x00\n\x00<\x00/\x00b\x00o\x00d\x00y\x00>\x00<\x00/\x00h\x00t\x00m\x00l\x00>\x00'
>>> print contents.encode('utf16').decode('cp1252')[:2]
ÿþ
The UTF16 byte order mark is printed as ÿþ
, the Unicode U+00FF and U+00FE codepoints.
Decode your response data UTF-16:
print ra.content.decode('utf16')
You probably want to open the files in binary mode, otherwise your newlines are going to be broken up. Use:
with open(file_path, 'rb') as file:
I've corrected the code in the answer you referenced.

Martijn Pieters
- 1,048,767
- 296
- 4,058
- 3,343
-
THank much. I was about to ask how did you guess? Even with decode-ut16 I am getting squares are end of line markers. Like so: – Pradeep Sep 23 '15 at 14:14
-
@Pradeep: I looked closer and saw that between the square blocks were the characters. ASCII Characters interspersed with undecodable characters is a classic sign of badly decoded UTF-16. – Martijn Pieters Sep 23 '15 at 14:18
-
@Pradeep: added a solution to your newlines problem too; UTF-16 encodes a newline as `\n\x00\r\x00` and that is probably going to break up with reading the file in text mode. – Martijn Pieters Sep 23 '15 at 14:20
-
Many thanks again. this is such a roller coaster. a bit off-track. using `with open(file_path, 'rb') as file`: isn't there a clash with `ra=requests_session.get('file://file')`? using just `ra=requests_session.get('file')` will not work because requests expects a schema, defaults to url. so I get:** MissingSchema: Invalid URL 'file': No schema supplied. Perhaps you meant http://file?** – Pradeep Sep 23 '15 at 14:30
-
@Pradeep: you copied code from another answer, the transporter needs to open the file to be able to serve you the file contents. The `open()` call is part of that transport. – Martijn Pieters Sep 23 '15 at 14:34