Invalid continuation byte while reading .txt file

Question

I'm getting this error in my python code:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe4 in position 5884: invalid continuation byte

The script is for a dictionary attack using the Crackstation dictionary. I'm trying to make this for fun, but there's a problem when I try to iterate through the items in the dictionary.

pass_file = open(pass_doc, 'r')

for word in pass_file:

pass_doc is a .txt file, NOT .csv. Does it have to be .csv?

I've tried using load_text() instead of open(), but all I want is a simple list of items. What should happen is the code runs through all the items in the dictionary, stored in a list, and I don't know really what's wrong.

You need to pass an `encoding=` parameter to the `open()` call that matches the actual encoding of the file. (You haven't supplied enough information for us to tell what that encoding might be.) — jasonharper, Apr 17 '23 at 16:58
`utf-8` is the default encoding method assumed by `open`; your file is *not* UTF-8-encoded. — chepner, Apr 17 '23 at 17:09
Are there any characters other than normal U.S. Keyboard chars? — Garlic Bread Express, Jun 27 '23 at 20:07
Thanks everyone i just came back to the project a simply saved a UTF-8 version. When i first tried this i had no idea that it wasnt already! — jedd, Aug 29 '23 at 18:52

Garlic Bread Express · Answer 1 · 2023-04-18T01:08:03.850

-2

Make your text file encoded as utf-8 when saving it. If you want to keep the current encoding, try this:

import codecs
BLOCKSIZE = 1048576 # or some other, desired size in bytes
with codecs.open(sourceFileName, "r", "your-source-encoding") as sourceFile:
    with codecs.open(targetFileName, "w", "utf-8") as targetFile:
        while True:
            contents = sourceFile.read(BLOCKSIZE)
            if not contents:
                break
            targetFile.write(contents)

This question might also help: UnicodeDecodeError: 'utf8' codec can't decode byte 0x9c

edited Apr 18 '23 at 01:08

answered Apr 17 '23 at 16:58

Garlic Bread Express

105
2
16

If you think my answer isn't very good, please suggest a way that I can improve it. – Garlic Bread Express Apr 17 '23 at 17:13
There isn't really a way to improve this – it's just wrong. `open(foo)` is exactly the same as `open(foo, 'r')`, so this isn't a solution. On top of an obvious syntax error the code suggests an anti-pattern – `data = f.readlines()` *needlessly* pulls the entire file into memory and keeps it there. – MisterMiyagi Apr 17 '23 at 17:58
This code is no way fixes the problem. It's just another way of doing the same wrong thing. – Frank Yellin Apr 17 '23 at 18:01
Alright, I have changed the code in a way that it can help. – Garlic Bread Express Apr 18 '23 at 01:08
1

Why `codecs.open()` instead of the regular `open()`? – John Gordon Apr 18 '23 at 01:14
@John Gordon, see this: https://stackoverflow.com/questions/5250744/difference-between-open-and-codecs-open-in-python – Garlic Bread Express Apr 18 '23 at 01:27
That Q&A is rather outdated and focuses on Python 2 but still states multiple times that in Python 3, the plain `open` is appropriate. – MisterMiyagi Apr 18 '23 at 16:28
I prefer to use that in case someone is using python 2. – Garlic Bread Express Apr 19 '23 at 01:20

Invalid continuation byte while reading .txt file

1 Answers1