0

I have a CSV file that I'm uploading via an HTML form to a Python API

The API looks like this:

@app.route('/add_candidates_to_db', methods=['GET','POST'])
def add_candidates():
    file = request.files['csv_file']
    x = io.StringIO(file.read().decode('UTF8'), newline=None)
    csv_input = csv.reader(x)
    for row in csv_input:
        print(row)

I found the part of the file that causes the issue. In my file it has Í character.

I get this error: UnicodeDecodeError: 'utf8' codec can't decode byte 0xea in position 1317: invalid continuation byte

I thought I was decoding it with .decode('UTF8') or is the error happening before that with file.read()?

How do I fix this?

** **

Edit: I have control of the file. I am creating the CSV file myself by pulling data (sometimes this data has strange characters).

One the server side, I'm reading each row in the file and inserting into a database.

Morgan Allen
  • 3,291
  • 8
  • 62
  • 86
  • 2
    It's telling you it is not valid utf8. To fix it, use valid utf8. – pvg Dec 21 '16 at 21:35
  • ahh, it's saying I can't decode it into utf8 because it's not a valid utf8 character? – Morgan Allen Dec 21 '16 at 21:38
  • Right. For questions like this, posting the version of Python is useful as well. This looks like python 2. – pvg Dec 21 '16 at 21:41
  • yes, this is version 2. When I try this in the interpreter, it works `'VÍctor'.decode('utf8')` returns `u'V\xcdctor'` so why does it break in the script? – Morgan Allen Dec 21 '16 at 21:44
  • 3
    I have no idea, have you looked at your file? The error is essentially 'file is not utf8 encoded' which probably means the file is not actually utf8 encoded. – pvg Dec 21 '16 at 21:45
  • If I take out row that has the special character, its fine. The file is a CSV file that I've uploaded. How do I make the file utf8 encoded? – Morgan Allen Dec 21 '16 at 21:46
  • 1
    Open it in a text editor and save it as utf8 encoded is one simple way. Assuming the editor is correctly guessing the encoding to begin with. – pvg Dec 21 '16 at 21:49
  • The actual right answer probably depends on you describing what you're trying to accomplish and what the inputs and outputs are. For instance, a lot of these issues are easily avoidable if it's possible for the input to be text someone pastes or types into a form rather than a file. With a file, you don't have any guarantees of the encoding. – pvg Dec 21 '16 at 21:51
  • Just updated my question! – Morgan Allen Dec 21 '16 at 21:53
  • 2
    Ok, that is basically a different question. You have to either know or guess (programatically) the encoding of the data and covert it to utf8. The data isn't utf8 and the only way to make it utf8 is to know what encoding it's in to begin with and convert. – pvg Dec 21 '16 at 21:59
  • What program or technique are you using to create the CSV file yourself? – Robᵩ Dec 21 '16 at 22:01
  • Can you read the file using `from codecs import open; open(filename, encoding='utf-8').read()` ? – jmd_dk Dec 21 '16 at 22:01
  • @jmd_dk error : `TypeError: coercing to Unicode: need string or buffer, FileStorage found` – Morgan Allen Dec 21 '16 at 22:04
  • flagging this as a dupe now that the question is clearer – pvg Dec 21 '16 at 22:34

2 Answers2

2

Your data is not UTF-8, it contains errors. You say that you are generating the data, so the ideal solution is to generate better data.

Unfortunately, sometimes we are unable to get high-quality data, or we have servers that give us garbage and we have to sort it out. For these situations, we can use less strict error handling when decoding text.

Instead of:

file.read().decode('UTF8')

You can use:

file.read().decode('UTF8', 'replace')

This will make it so that any “garbage” characters (anything which is not correctly encoded as UTF-8) will get replaced with U+FFFD, which looks like this:

You say that your file has the Í character, but you are probably viewing the file using an encoding other than UTF-8. Is your file supposed to contain Í, or is it just mojibake? Maybe you can figure out what the character is supposed to be, and from that, you can figure out what encoding your data uses if it's not UTF-8.

Dietrich Epp
  • 205,541
  • 37
  • 345
  • 415
  • This is not a great answer without at least pointing out that there are tools out there that attempt to determine character encoding and they are often quite effective. As it stands, your answer is basically 'trash the data'. This isn't really necessary. – pvg Dec 21 '16 at 22:50
  • @pvg: That's a great contribution. Please feel free to hit the "edit" button. – Dietrich Epp Dec 21 '16 at 22:53
-1

It seems that your file is not encoded in utf8. You can try reading the file with all the encodings that Python understand and check which lets you read the entire content of the file. Try this script:

from codecs import open

encodings = [
    "ascii",
    "big5",
    "big5hkscs",
    "cp037",
    "cp424",
    "cp437",
    "cp500",
    "cp720",
    "cp737",
    "cp775",
    "cp850",
    "cp852",
    "cp855",
    "cp856",
    "cp857",
    "cp858",
    "cp860",
    "cp861",
    "cp862",
    "cp863",
    "cp864",
    "cp865",
    "cp866",
    "cp869",
    "cp874",
    "cp875",
    "cp932",
    "cp949",
    "cp950",
    "cp1006",
    "cp1026",
    "cp1140",
    "cp1250",
    "cp1251",
    "cp1252",
    "cp1253",
    "cp1254",
    "cp1255",
    "cp1256",
    "cp1257",
    "cp1258",
    "euc_jp",
    "euc_jis_2004",
    "euc_jisx0213",
    "euc_kr",
    "gb2312",
    "gbk",
    "gb18030",
    "hz",
    "iso2022_jp",
    "iso2022_jp_1",
    "iso2022_jp_2",
    "iso2022_jp_2004",
    "iso2022_jp_3",
    "iso2022_jp_ext",
    "iso2022_kr",
    "latin_1",
    "iso8859_2",
    "iso8859_3",
    "iso8859_4",
    "iso8859_5",
    "iso8859_6",
    "iso8859_7",
    "iso8859_8",
    "iso8859_9",
    "iso8859_10",
    "iso8859_13",
    "iso8859_14",
    "iso8859_15",
    "iso8859_16",
    "johab",
    "koi8_r",
    "koi8_u",
    "mac_cyrillic",
    "mac_greek",
    "mac_iceland",
    "mac_latin2",
    "mac_roman",
    "mac_turkish",
    "ptcp154",
    "shift_jis",
    "shift_jis_2004",
    "shift_jisx0213",
    "utf_32",
    "utf_32_be",
    "utf_32_le",
    "utf_16",
    "utf_16_be",
    "utf_16_le",
    "utf_7",
    "utf_8",
    "utf_8_sig",
]

for encoding in encodings:
    try:
        with open(file, encoding=encoding) as f:
            f.read()
        print('Seemingly working encoding: {}'.format(encoding))
    except:
        pass

where file is again the filename of your file.

jmd_dk
  • 12,125
  • 9
  • 63
  • 94
  • All of the single-byte encodings (which I believe includes all `cp*` and `iso8859*` encodings) will be able to read the file without error, but the user will still have to examine the results to check whether the file was decoded to the correct characters. – jwodder Dec 21 '16 at 22:25
  • This is not a sensible answer. There are better ways to attempt encoding detection. – pvg Dec 21 '16 at 22:33
  • Hence the wording "seemingly working". The script finds encodings which do not throw an error. No guarantee is made that a given encoding actually does the job correctly. – jmd_dk Dec 21 '16 at 22:34
  • Having the code print 'this is a useless answer' doesn't make the answer less useless. Chardet and other tools exist that make a much better attempt at this. Additionally, for all we know, the encoding might be included in wherever the poster is getting data from. Or defined. – pvg Dec 21 '16 at 22:44