I have a large .csv that I'm trying to import into a MySQL database for a Django project. I'm using the django.db library to write raw sql statements such as:
LOAD DATA LOCAL INFILE 'file.csv'...
However, I keep getting the following error:
django.db.utils.OperationalError: (1300, "Hey! Are you out tonight?")
After grepping the .csv for the line, I realised that the error is being caused by this character: ; though I'm sure there will be other characters throwing that error after I fix this.
Running:
$ file --mime file.csv
from a terminal, returns:
$ file.csv: text/html; charset=us-ascii
Since the rest of my db is in UTF-8, I tried writing a python script to re-encode it, using .encode('utf-8', 'ignore')
hoping that the 'ignore' would remove any symbols that gave it trouble, but it threw:
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe9 in position 825410: invalid continuation byte
The thing is, I don't actually care about inserting 100% of the file into my db. I would rather just insert only the 'safe' lines that don't contain strange characters.
So ideally, I'm looking for a way to modify my LOAD DATA LOCAL INFILE
sql statement so it just skips inserting any lines that give it trouble. This is optimal, since I don't want to spend time preprocessing the data.
If that isn't feasible, the next best thing is to remove any troublesome character/lines with a Python script that I could later run from my django app whenever I update my db.
If all else fails, information on how to grep out any characters that aren't UTF-8 friendly that I could write a shell script around would be useful.