0

I am reading a HTML file, saved in local disk, using requests and a LocalFileAdapter, as per @b1r3k on Python requests fetch a file from a local url. the relevant part is:requests_session = requests.session() requests_session.mount('file://', LocalFileAdapter()) ra=requests_session.get('file://X:\somefile.htm') print ra.content. I am getting some junk on console junk. If you see closely, it is actual alphabets separated by these weird squares. What can I do to make it human-readable?

Community
  • 1
  • 1
Pradeep
  • 350
  • 3
  • 16

1 Answers1

1

You have UTF-16 encoded data, with a BOM, printed out to a console that uses the Windows 1252 codepage:

>>> contents = u'''\
... <html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:w="urn:schemas-microsoft-com:office:word" xmlns:m="http://schemas.microsoft.com/office/2004/12/omml" xmlns="http://www.w3.org/TR/REC-html40"><head>
... <meta http-equiv="Content-Type" content="text/html; charset=unicode">
... <meta name="ProgId" content="Word.Document">[snip]
... </body></html>'''
>>> contents.encode('utf16').decode('cp1252')
u'\xff\xfe<\x00h\x00t\x00m\x00l\x00 \x00x\x00m\x00l\x00n\x00s\x00:\x00v\x00=\x00"\x00u\x00r\x00n\x00:\x00s\x00c\x00h\x00e\x00m\x00a\x00s\x00-\x00m\x00i\x00c\x00r\x00o\x00s\x00o\x00f\x00t\x00-\x00c\x00o\x00m\x00:\x00v\x00m\x00l\x00"\x00 \x00x\x00m\x00l\x00n\x00s\x00:\x00o\x00=\x00"\x00u\x00r\x00n\x00:\x00s\x00c\x00h\x00e\x00m\x00a\x00s\x00-\x00m\x00i\x00c\x00r\x00o\x00s\x00o\x00f\x00t\x00-\x00c\x00o\x00m\x00:\x00o\x00f\x00f\x00i\x00c\x00e\x00:\x00o\x00f\x00f\x00i\x00c\x00e\x00"\x00 \x00x\x00m\x00l\x00n\x00s\x00:\x00w\x00=\x00"\x00u\x00r\x00n\x00:\x00s\x00c\x00h\x00e\x00m\x00a\x00s\x00-\x00m\x00i\x00c\x00r\x00o\x00s\x00o\x00f\x00t\x00-\x00c\x00o\x00m\x00:\x00o\x00f\x00f\x00i\x00c\x00e\x00:\x00w\x00o\x00r\x00d\x00"\x00 \x00x\x00m\x00l\x00n\x00s\x00:\x00m\x00=\x00"\x00h\x00t\x00t\x00p\x00:\x00/\x00/\x00s\x00c\x00h\x00e\x00m\x00a\x00s\x00.\x00m\x00i\x00c\x00r\x00o\x00s\x00o\x00f\x00t\x00.\x00c\x00o\x00m\x00/\x00o\x00f\x00f\x00i\x00c\x00e\x00/\x002\x000\x000\x004\x00/\x001\x002\x00/\x00o\x00m\x00m\x00l\x00"\x00 \x00x\x00m\x00l\x00n\x00s\x00=\x00"\x00h\x00t\x00t\x00p\x00:\x00/\x00/\x00w\x00w\x00w\x00.\x00w\x003\x00.\x00o\x00r\x00g\x00/\x00T\x00R\x00/\x00R\x00E\x00C\x00-\x00h\x00t\x00m\x00l\x004\x000\x00"\x00>\x00<\x00h\x00e\x00a\x00d\x00>\x00\n\x00<\x00m\x00e\x00t\x00a\x00 \x00h\x00t\x00t\x00p\x00-\x00e\x00q\x00u\x00i\x00v\x00=\x00"\x00C\x00o\x00n\x00t\x00e\x00n\x00t\x00-\x00T\x00y\x00p\x00e\x00"\x00 \x00c\x00o\x00n\x00t\x00e\x00n\x00t\x00=\x00"\x00t\x00e\x00x\x00t\x00/\x00h\x00t\x00m\x00l\x00;\x00 \x00c\x00h\x00a\x00r\x00s\x00e\x00t\x00=\x00u\x00n\x00i\x00c\x00o\x00d\x00e\x00"\x00>\x00\n\x00<\x00m\x00e\x00t\x00a\x00 \x00n\x00a\x00m\x00e\x00=\x00"\x00P\x00r\x00o\x00g\x00I\x00d\x00"\x00 \x00c\x00o\x00n\x00t\x00e\x00n\x00t\x00=\x00"\x00W\x00o\x00r\x00d\x00.\x00D\x00o\x00c\x00u\x00m\x00e\x00n\x00t\x00"\x00>\x00[\x00s\x00n\x00i\x00p\x00]\x00\n\x00<\x00/\x00b\x00o\x00d\x00y\x00>\x00<\x00/\x00h\x00t\x00m\x00l\x00>\x00'
>>> print contents.encode('utf16').decode('cp1252')[:2]
ÿþ

The UTF16 byte order mark is printed as ÿþ, the Unicode U+00FF and U+00FE codepoints.

Decode your response data UTF-16:

print ra.content.decode('utf16')

You probably want to open the files in binary mode, otherwise your newlines are going to be broken up. Use:

with open(file_path, 'rb') as file:

I've corrected the code in the answer you referenced.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • THank much. I was about to ask how did you guess? Even with decode-ut16 I am getting squares are end of line markers. Like so: – Pradeep Sep 23 '15 at 14:14
  • @Pradeep: I looked closer and saw that between the square blocks were the characters. ASCII Characters interspersed with undecodable characters is a classic sign of badly decoded UTF-16. – Martijn Pieters Sep 23 '15 at 14:18
  • @Pradeep: added a solution to your newlines problem too; UTF-16 encodes a newline as `\n\x00\r\x00` and that is probably going to break up with reading the file in text mode. – Martijn Pieters Sep 23 '15 at 14:20
  • Many thanks again. this is such a roller coaster. a bit off-track. using `with open(file_path, 'rb') as file`: isn't there a clash with `ra=requests_session.get('file://file')`? using just `ra=requests_session.get('file')` will not work because requests expects a schema, defaults to url. so I get:** MissingSchema: Invalid URL 'file': No schema supplied. Perhaps you meant http://file?** – Pradeep Sep 23 '15 at 14:30
  • @Pradeep: you copied code from another answer, the transporter needs to open the file to be able to serve you the file contents. The `open()` call is part of that transport. – Martijn Pieters Sep 23 '15 at 14:34