Unable to read file as plain text

Question

I have a powershell script which im trying to read in and do some analysis on it. I'm able to read it only as bytes and not as plain text.

f=open('~/Data/3 - Get-Services - Jobs Version 1.0.ps1','r')
txt=f.read()

When i try the above code, im thrown an error.

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

So, I tried reading it as bytes and then decode it to plain text but but i'm still thrown the error.

f=open('~/Data/3 - Get-Services - Jobs Version 1.0.ps1','rb')
txt=f.read()
txt.decode('utf-8')

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

I looked up the data on command line and i noticed that all the files start with "��". I feel this charachter is causing the problem but i do not know how to solve this issue.

Could you please help

*No*. I said UTF-8 is *not what the file is encoded in*. Try another encoding. It looks like that is a BOM, try `'utf-8-sig'` — juanpa.arrivillaga, Apr 12 '19 at 07:43
Possible duplicate of [error UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte](https://stackoverflow.com/questions/42339876/error-unicodedecodeerror-utf-8-codec-cant-decode-byte-0xff-in-position-0-in) — YesThatIsMyName, Apr 12 '19 at 07:45
Hi, try using `utf-16` for decoding as the above duplicate suggests in an answer. — YesThatIsMyName, Apr 12 '19 at 07:48
@juanpa.arrivillaga, sorry for misreading your initial comment. I changed the encoding to `'utf-8-sig'` but it still doesnt work — Sridhar Murali, Apr 12 '19 at 07:48
@YesThatIsMyName, thanks for the answer! changing the encoding type to `utf-16` works! thanks you! — Sridhar Murali, Apr 12 '19 at 07:49

Thomas · Answer 1 · 2019-04-12T08:16:49.127

4

Edit: despite four upvotes, my guess was wrong. In UTF-8 encoding, the BOM would look like 0xEF,0xBB,0xBF, so the first byte is 0xEF and not 0xFF.

0xFF,0xFE would signify the start of a little-endian UTF-16 file. Use the utf-16 encoding for that!

My guess is that the two "unknown" characters at the start are a Unicode BOM (byte-order mark).

If that's the case, decode with utf-8-sig instead of utf-8. There's no need to read as bytes first; you can pass an encoding to the open() function directly:

f = open('~/Data/3 - Get-Services - Jobs Version 1.0.ps1', 'r', encoding='utf-8-sig')

edited Apr 12 '19 at 08:16

answered Apr 12 '19 at 07:43

Thomas

174,939
50
355
478

@SridharMurali That's because I made a mistake, see my edit! – Thomas Apr 12 '19 at 08:17

score 2 · Accepted Answer · answered Apr 12 '19 at 07:58

I cite the answer from Peter Ogden from error UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte (not the accepted answer) .

I've come across this thread when suffering the same error, after doing some research I can confirm, this is an error that happens when you try to decode a UTF-16 file with UTF-8.

With UTF-16 the first characther (2 bytes in UTF-16) is a Byte Order Mark (BOM), which is used as a decoding hint and doesn't appear as a character in the decoded string. This means the first byte will be either FE or FF and the second, the other.

Heavily edited after I found out the real answer

So, changing to UTF-16 should fix your problem.

Unable to read file as plain text

2 Answers2