0

I have a powershell script which im trying to read in and do some analysis on it. I'm able to read it only as bytes and not as plain text.

f=open('~/Data/3 - Get-Services - Jobs Version 1.0.ps1','r')
txt=f.read()

When i try the above code, im thrown an error.

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

So, I tried reading it as bytes and then decode it to plain text but but i'm still thrown the error.

f=open('~/Data/3 - Get-Services - Jobs Version 1.0.ps1','rb')
txt=f.read()
txt.decode('utf-8')

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte

I looked up the data on command line and i noticed that all the files start with "��". I feel this charachter is causing the problem but i do not know how to solve this issue.

Could you please help

Sridhar Murali
  • 380
  • 1
  • 11

2 Answers2

4

Edit: despite four upvotes, my guess was wrong. In UTF-8 encoding, the BOM would look like 0xEF,0xBB,0xBF, so the first byte is 0xEF and not 0xFF.

0xFF,0xFE would signify the start of a little-endian UTF-16 file. Use the utf-16 encoding for that!


My guess is that the two "unknown" characters at the start are a Unicode BOM (byte-order mark).

If that's the case, decode with utf-8-sig instead of utf-8. There's no need to read as bytes first; you can pass an encoding to the open() function directly:

f = open('~/Data/3 - Get-Services - Jobs Version 1.0.ps1', 'r', encoding='utf-8-sig')
Thomas
  • 174,939
  • 50
  • 355
  • 478
2

I cite the answer from Peter Ogden from error UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 0: invalid start byte (not the accepted answer) .

I've come across this thread when suffering the same error, after doing some research I can confirm, this is an error that happens when you try to decode a UTF-16 file with UTF-8.

With UTF-16 the first characther (2 bytes in UTF-16) is a Byte Order Mark (BOM), which is used as a decoding hint and doesn't appear as a character in the decoded string. This means the first byte will be either FE or FF and the second, the other.

Heavily edited after I found out the real answer

So, changing to UTF-16 should fix your problem.

YesThatIsMyName
  • 1,585
  • 3
  • 23
  • 30