0

I am trying to write a script that would clean unnecessary characters from a data txt file. I was able to successfully run the script once but every other attempt gives the error UnicodeDecodeError: 'utf-8' codec can't decode byte 0xa2 in position 8149: invalid start byte

import codecs
import sys

if len(sys.argv) < 2:
        startFile = "test.txt"
else:
        startFile = sys.argv[1]

finishFile = "newtest.txt"



def cleanFile():
        f = open(startFile, "r")
        #f = codecs.open("GNMFDB.TXT", "r", "utf-8")
        newFile = open(finishFile, "a")

        for line in f:
                line = line.replace("=", "")

                newFile.write(line)


def clearNewFile():
        newFile = open(finishFile, "w")
        newFile.close()


if __name__ == "__main__":
        #startFile = "test.txt"
        #finishFile = "newtest.txt"
        clearNewFile()
        cleanFile()

I know the issue has to do with UTF-8 trying to be converted to strings or something along those lines. Copying some lines from the original .txt file and putting them in a seperate .txt file I created in vim does cause the script to run successfully every time. I know codecs could be used for a situation like this but when i tried it it gave me similar error (hence the line being commented out).

wvano97
  • 82
  • 6
  • yes its with the encoding [this](https://stackoverflow.com/questions/21504319/python-3-csv-file-giving-unicodedecodeerror-utf-8-codec-cant-decode-byte-err) – Trilok Venkat Jun 17 '20 at 16:06
  • It seems you never close the opened files in function cleanFile, try to close. – Jason Yang Jun 17 '20 at 16:07
  • Does this answer your question? [Python 3 CSV file giving UnicodeDecodeError: 'utf-8' codec can't decode byte error when I print](https://stackoverflow.com/questions/21504319/python-3-csv-file-giving-unicodedecodeerror-utf-8-codec-cant-decode-byte-err) – lenz Jun 17 '20 at 19:50

1 Answers1

0

Did you tried to first encode it and then decode it when writing it to newFile? when you are reading the file , in this line , you first have to encode each line when you are reading the lines then do your work on each line and then again decode it with utf-8 : for line in f: line.encode('utf-8') "your code goes here" line.decode('utf-8') and another solution you may try is put try and except block inside for loop to check whether it happening in all lines or a few ,if its happening in few lines , you may drop them,Hope this help.

  • There are multiple problems with your code. For one, if you *encode* a line before processing, this means you'll have to deal with byte string, which is almost never what you want to do if the input is text – you want to process text. Then the statement `line.encode('utf8')` without capturing the return value is purposeless – it has no effect on the variable `line` (or any other variable). – lenz Jun 17 '20 at 19:45
  • The truth is: text needs to be *decoded* when read, *encoded" when written (not the other way round, as you write). But if you use `open(..., 'r')` and `open(..., 'w')` for opening the files, then Python does the encoding steps, and you mustn't call `.encode()` and `.decode()` yourself. – lenz Jun 17 '20 at 19:48