I have a (frustrating) mystery I need help solving. I have a MySQL table with a column storing URLs, specifically URLs to tracks on last.fm. I thought I made sure to store them in their escaped html format (e.g. "ünloco" becomes "'%C3%BCnloco"), since I'm using the latin collation (not utf-8), but something strange is going on.
I backed up the table with mysqldump, truncated the table, and then tried to reimport the same data using the standard "load data infile" syntax.
But it would stop after reading in a ~300k of ~40million of rows (it's not a line termination issue...already checked). No errors, no warnings...it was as if MySQL thought the file was shorter than it really was.
After reading the file into Python, it turns out it was failing on a row that had a hex character in it (e.g. \xa0). That is, it would read in all rows up until it hit that one. I verified this was the problem by removing that row and trying again, upon which it loaded data up until the next line with a hex character, and stopped.
So these characters are definitely the problem, but I have no idea how they ended up in the table (shouldn't the latin collation not allow that?). More importantly, how do I get the data back in the database now?
Do I have to manually edit the dumped text file and remove/edit these lines, or is there a way to make MySQL read the file in correctly? In the former case, how can I efficiently find the problematic lines (command line or Python solutions appreciated...there are too many lines to manually examine).
Editing to include example of the problematic lines, the first of which I've learned is:
for i,line in enumerate(open(filename)):
if i==350616:
break
print line
350317 2 sevish 346067 \N á sevish/_/á
print repr(line)
'350317\t2\tsevish\t346067\t\\N\t\xa0\tsevish/_/\xa0\n'