1

I have a .txt file generated by a built-in tool in Windows OS that I need to parse in a python script (on a Linux machine if it is relevant).

I open the file as such:

with open(path, 'r') as spec_file:

I even tried the io lib

io.open(detail, mode="r", encoding="utf-8") as spec_file:

When the file is opened in (e.g.) Sublime text the file is displayed properly, when iterating through the file line by line by:

for line in spec_file:

and printing (print(line)) I get the correct representation as well:

**********************************************************************************
* This diagnostic information may be used by an IT administrator to troubleshoot *
* the installed Trusted Platform Module (TPM). Please zip the folder and attach  *
* it to issues filed through Feedback Hub or with an IT admin.                   *
**********************************************************************************

however, when printing as print(repr(line)) I only get the char byte representation:

'*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00\n'
'\x00\n'
'\x00*\x00 \x00T\x00h\x00i\x00s\x00 \x00d\x00i\x00a\x00g\x00n\x00o\x00s\x00t\x00i\x00c\x00 \x00i\x00n\x00f\x00o\x00r\x00m\x00a\x00t\x00i\x00o\x00n\x00 \x00m\x00a\x00y\x00 \x00b\x00e\x00 \x00u\x00s\x00e\x00d\x00 \x00b\x00y\x00 \x00a\x00n\x00 \x00I\x00T\x00 \x00a\x00d\x00m\x00i\x00n\x00i\x00s\x00t\x00r\x00a\x00t\x00o\x00r\x00 \x00t\x00o\x00 \x00t\x00r\x00o\x00u\x00b\x00l\x00e\x00s\x00h\x00o\x00o\x00t\x00 \x00*\x00\n'
'\x00\n'
'\x00*\x00 \x00t\x00h\x00e\x00 \x00i\x00n\x00s\x00t\x00a\x00l\x00l\x00e\x00d\x00 \x00T\x00r\x00u\x00s\x00t\x00e\x00d\x00 \x00P\x00l\x00a\x00t\x00f\x00o\x00r\x00m\x00 \x00M\x00o\x00d\x00u\x00l\x00e\x00 \x00(\x00T\x00P\x00M\x00)\x00.\x00 \x00P\x00l\x00e\x00a\x00s\x00e\x00 \x00z\x00i\x00p\x00 \x00t\x00h\x00e\x00 \x00f\x00o\x00l\x00d\x00e\x00r\x00 \x00a\x00n\x00d\x00 \x00a\x00t\x00t\x00a\x00c\x00h\x00 \x00 \x00*\x00\n'
'\x00\n'
'\x00*\x00 \x00i\x00t\x00 \x00t\x00o\x00 \x00i\x00s\x00s\x00u\x00e\x00s\x00 \x00f\x00i\x00l\x00e\x00d\x00 \x00t\x00h\x00r\x00o\x00u\x00g\x00h\x00 \x00F\x00e\x00e\x00d\x00b\x00a\x00c\x00k\x00 \x00H\x00u\x00b\x00 \x00o\x00r\x00 \x00w\x00i\x00t\x00h\x00 \x00a\x00n\x00 \x00I\x00T\x00 \x00a\x00d\x00m\x00i\x00n\x00.\x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00 \x00*\x00\n'
'\x00\n'
'\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00*\x00\n'

thus making it impossible to search through the file and work with it as a string, so I need to somehow convert it to utf-8 string, any ideas how that could be possible?

T. Maxx
  • 55
  • 1
  • 12

1 Answers1

2

Your file is encoded in UTF-16 LE (because Windows, see this question for more info), so you need to set that as the encoding:

with open(path, 'r', encoding="utf-16le") as spec_file:

LE stands for Little Endian, which is important, as regular "utf-16" checks for a Byte Order Mark, which Windows doesn't output (again, because Windows), so you need to explicitly state the endianness.

SuperStormer
  • 4,997
  • 5
  • 25
  • 35