2

I have a (frustrating) mystery I need help solving. I have a MySQL table with a column storing URLs, specifically URLs to tracks on last.fm. I thought I made sure to store them in their escaped html format (e.g. "ünloco" becomes "'%C3%BCnloco"), since I'm using the latin collation (not utf-8), but something strange is going on.

I backed up the table with mysqldump, truncated the table, and then tried to reimport the same data using the standard "load data infile" syntax.

But it would stop after reading in a ~300k of ~40million of rows (it's not a line termination issue...already checked). No errors, no warnings...it was as if MySQL thought the file was shorter than it really was.

After reading the file into Python, it turns out it was failing on a row that had a hex character in it (e.g. \xa0). That is, it would read in all rows up until it hit that one. I verified this was the problem by removing that row and trying again, upon which it loaded data up until the next line with a hex character, and stopped.

So these characters are definitely the problem, but I have no idea how they ended up in the table (shouldn't the latin collation not allow that?). More importantly, how do I get the data back in the database now?

Do I have to manually edit the dumped text file and remove/edit these lines, or is there a way to make MySQL read the file in correctly? In the former case, how can I efficiently find the problematic lines (command line or Python solutions appreciated...there are too many lines to manually examine).


Editing to include example of the problematic lines, the first of which I've learned is:

for i,line in enumerate(open(filename)):
    if i==350616:
        break

print line 
350317  2       sevish  346067  \N      á       sevish/_/á

print repr(line)
'350317\t2\tsevish\t346067\t\\N\t\xa0\tsevish/_/\xa0\n'
moustachio
  • 2,924
  • 3
  • 36
  • 68
  • See also http://stackoverflow.com/questions/752998/how-to-work-with-unicode-in-python and please track down the source or use of the \xa0 as it might be significant. Maybe you're storing special quote marks or something. – leewz Feb 03 '14 at 08:17
  • After digging into some old database backups, the problem is definitely some special characters somehow ending up in the database (I still have no idea how!). I can fix them, but still need a way to identify which lines contain such characters... – moustachio Feb 03 '14 at 18:06

2 Answers2

0

(I assume that \xa0 is a single byte in the file, and that you're on Windows.)

Open in binary mode.

file = open(fname, 'rb')

Otherwise, it assumes ASCII and can break on non-ASCII characters (\x80 or so and above).

Note: You'll get byte arrays from read operations instead of strings. You may need to convert input using str, which may still break on non-ASCII.

In binary mode, you shouldn't use next (using the file as an iterable), readline, or readlines, because they look for ASCII. Use read instead.

leewz
  • 3,201
  • 1
  • 18
  • 38
0

Ok, so I finally figured out a way to more or less simply find which lines were problematic:

for line in open(filename):
    try:
        line.decode('ascii')
    except:
        print line

Basically, I try to decode the line in ascii, and if that doesn't work, I print the line. This pulled out the relevant lines and allowed me to edit them so MySQL could read them in properly.

This of course doesn't address how these values got there in the first place, but at least allowed me to fix the problem.

moustachio
  • 2,924
  • 3
  • 36
  • 68