Insert large csv to MySQL, ignore lines with unknown characters

Question

I have a large .csv that I'm trying to import into a MySQL database for a Django project. I'm using the django.db library to write raw sql statements such as:

LOAD DATA LOCAL INFILE 'file.csv'...

However, I keep getting the following error:

django.db.utils.OperationalError: (1300, "Hey! Are you out tonight?")

After grepping the .csv for the line, I realised that the error is being caused by this character: ; though I'm sure there will be other characters throwing that error after I fix this.

Running:

$ file --mime file.csv

from a terminal, returns:

$ file.csv: text/html; charset=us-ascii

Since the rest of my db is in UTF-8, I tried writing a python script to re-encode it, using .encode('utf-8', 'ignore') hoping that the 'ignore' would remove any symbols that gave it trouble, but it threw:

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 825410: invalid continuation byte

The thing is, I don't actually care about inserting 100% of the file into my db. I would rather just insert only the 'safe' lines that don't contain strange characters.

So ideally, I'm looking for a way to modify my LOAD DATA LOCAL INFILE sql statement so it just skips inserting any lines that give it trouble. This is optimal, since I don't want to spend time preprocessing the data.

If that isn't feasible, the next best thing is to remove any troublesome character/lines with a Python script that I could later run from my django app whenever I update my db.

If all else fails, information on how to grep out any characters that aren't UTF-8 friendly that I could write a shell script around would be useful.

Have you tried csv built-in library - https://docs.python.org/3/library/csv.html ? — chachan, Jun 29 '17 at 11:34
Could you elaborate on how the csv built-in library would help? Does it have a method to specify the charset or clean the data? — Sam, Jun 30 '17 at 02:42

score 2 · Answer 1 · answered Jul 05 '17 at 03:59

2

For , MySQL must use CHARACTER SET utf8mb4 on the column where you will be storing it, the LOAD DATA, and on the connection.

More Python notes: http://mysql.rjweb.org/doc.php/charcoll#python

E9 does not make sense. The hex for the UTF-8 encoding for is F09F989C.

The link on converting between character is irrelevant; only UTF-8 can be used for Emoji.

answered Jul 05 '17 at 03:59

Rick James

135,179
13
127
222

Wow, thanks for the link. Very informative and you are obviously very knowledgeable. Just to clarify your first sentence, I'm loading the csv with `LOAD DATA LOCAL INFILE file.csv IGNORE INTO TABLE content CHARACTER SET utf8mb4 ...` which I believe covers the first and second parts. Then my settings.py has the following for the DATABASE section: `'ENGINE': 'django.db.backends.mysql', ... 'OPTIONS': { 'charset': 'utf8mb4', 'use_unicode': True, }` Which I believe handles the connection part you were talking about. Is this correct, or is there something else I should be doing? – Sam Jul 06 '17 at 04:59
Hehe -- I'm faking it. I have never coded in Python. That link is a compilation of apparent solutions to character problems in Python and about 35 other languages, most of which I have not touched. So, I can't really answer your last question; let us know if it works. (Let me know if you find an error in that link; I will update it.) – Rick James Jul 06 '17 at 15:49
Haha, nice. Following the Django part of your blog got me further, but unfortunately it failed later in the csv on some Chinese characters. I ended up changing my `LOAD DATA` statement to use latin1 for the character set. I'm sure this will come back to bite me, but it worked to get the whole csv into my db. I have no idea why latin1 would work where utf8mb4 didn't though. – Sam Jul 07 '17 at 09:37
latin1 accepts all 8-bit values. utf8mb4 checks for validity. latin1 will bite you when you get to things like `CHAR_LENGTH()`, string comparisons, `ORDER BY`, etc., since it is not seeing the text correctly. Admittedly, it may not bite hard. – Rick James Jul 07 '17 at 15:55

score 1 · Answer 2 · answered Jul 01 '17 at 10:48

Not 100% sure if this will help but this is what I'd try:

Since open() is used to open a CSV file for reading, the file will by default be decoded into unicode using the system default encoding (see locale.getpreferredencoding()). To decode a file using a different encoding, use the encoding argument of open:

import csv
with open('some.csv', newline='', encoding='utf-8') as f:
    reader = csv.reader(f)
    for row in reader:
        print(row)

That's an example gathered from official docs. Have in mind that you might need to replace utf-8 with the actual file encoding, as docs say. Then you can either continue using python to push your data into DB or write a new file with a new encoding.

Alternatively, this could could be another approach.

Insert large csv to MySQL, ignore lines with unknown characters

2 Answers2