Invalid start byte error using replace() function in python

Question

I am running a simple code to replace a word with another in my files like so:

import random
import os

path = '/path/of/file/'
files = os.listdir (path)

for file in files:
    with open (path + file) as f:
        newText = f.read().replace('Plastic Ba','PlasticBag')

    with open (path + file, "w") as f:
        f.write(newText)

And in doing so I get an error that I have never encountered before :

Traceback (most recent call last):
  File "replaceText.py", line 9, in <module>
    newText = f.read().replace('Plastic Ba', 'PlasticBag')
  File "/Users/vivek/anaconda3/lib/python3.6/codecs.py", line 321, in decode
    (result, consumed) = self._buffer_decode(data, self.errors, final)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0x80 in position 3131: invalid start byte

I am not sure what this means or what the mistake here is? I have run this script multiple times in the past without any issues. Any help on resolving this would be great!

What is the encoding of the text file? Can you provide a sample of what the file looks like around the 3131st byte? — Daniel Pryden, Aug 21 '18 at 23:28
The `replace` is completely irrelevant here; the exception is coming from the `read()`, before you even get there. And what the exception means is that the file is not UTF-8 (e.g., it's Latin-1 or cp1252), but you've tried to open it as UTF-8. (Or, possibly, that it's UTF-8 but corrupted, but that's less likely.) — abarnert, Aug 21 '18 at 23:29
You could potentially resolve the problem by opening the file in binary mode and doing replacements only using byte strings. But probably the better solution is to open the file with the correct encoding (and yes, CP 1252 is probably a decent guess if it isn't UTF-8 but it is a superset of ASCII). — Daniel Pryden, Aug 21 '18 at 23:32
Specifically: `UnicodeDecodeError` means you're trying to `read`/`decode`/etc. text with the wrong encoding. `'utf-8'` is the encoding you're trying to use (it's the default for most things nowadays). `byte 0x80 in position 3131` is helpfully telling you where the problem happens, so you can, e.g., `with open(path+file, 'rb') as f: print(f.read()[3100:3200])` to debug the problem. (Or to post it on Stack Overflow so someone else can debug it.) — abarnert, Aug 21 '18 at 23:32
@DanielPryden Also, unlike Latin-1, cp1252 is a superset of ASCII where `\x80` is `'€'`, instead of a nonprinting control character, so… I probably should have suggested that one first. — abarnert, Aug 21 '18 at 23:34

Raviteja Ainampudi · Answer 1 · 2018-08-22T00:07:13.717

0

Did you try to encode the file to 'UTF-8' ? Please check the Open function parameters,

open(file, mode='r', buffering=-1, encoding=None, errors=None, newline=None, closefd=True, opener=None)

In your script, try using,

with open (path + file, 'r', encoding='windows-1252') as f:

You can also checkout the open method available in codecs library. Please checkout this questions. Unicode (UTF-8) reading and writing to files in Python

UnicodeDecodeError: 'utf8' codec can't decode byte 0x80 in position 3131: invalid start byte

edited Aug 22 '18 at 00:07

answered Aug 22 '18 at 00:01

Raviteja Ainampudi

282
1
11

1

The error message in the question indicates that the `utf-8` codec is already being used (and failing to decode), so that can't be the answer. – Daniel Pryden Aug 22 '18 at 00:04
What, you found a link to another question which fails to read a file and it fails at the same position `3131`, with the same error? That is weird. – zvone Aug 22 '18 at 00:19

Invalid start byte error using replace() function in python

1 Answers1