1

I have an unsanitized data source that I don't control and is hundreds of GB in size with fixed width columns. When I run

LOAD DATA LOCAL INFILE 
'B:\\path\\data.txt' IGNORE
INTO TABLE `bigtable`
(@row)
SET
field1      = TRIM(SUBSTR(@row,1,12)),
field2      = TRIM(SUBSTR(@row,13,192)),
field3      = TRIM(SUBSTR(@row,205,1)),
90+ more row.....

it won't import a single record on account of Invalid utf8 character string: 'L15000156741CHAMPION FITNESS, LLC

There doesn't look to be anything wrong with the string. There are no special characters. ASCII and utf8 overlap in this range right? This LOAD DATA statement would work in previous versions of MySQL but in version 5.7.19 it's not working and it's driving me up a wall. I don't need to keep the broken rows. None of the other answers have solved the situation for me including here and here.

Is there some way to simply skip the broken rows? Maybe a setting or statement I'm forgetting? I thought IGNORE should have fixed this but it still won't load.

Altimus Prime
  • 2,207
  • 2
  • 27
  • 46
  • 1
    It's possible that one of the characters look like an ASCII character but isn't. I would recommend creating a small script to go through each character one at a time and display any characters that do not fall within the standard ASCII range (or standard alphabetic range). Make sure your script, text file, and functions you use are designed to handle the full UTF-8 character set. (As a side note, you may also be able to narrow down where a problem character is by manually splitting the file into smaller and smaller chunks to track it down). – kojow7 Jun 28 '19 at 01:12
  • Please tell us the encoding of the text file and the encoding of your MySQL database. – Tim Biegeleisen Jun 28 '19 at 01:12
  • 1
    Is `LOAD DATA` returning a line number where the problem actually happened? If so, then you may use that to figure out the offending row, then amend that row – Tim Biegeleisen Jun 28 '19 at 01:19
  • Thanks for commenting. I have sed going over the files right now to sanitize them but it'll be a while. What I hoped to document here for all eternity is how to ignore the rows the way the way previous versions of MySQL would allow us to do. This right now took far longer than it should have and it still isn't done. – Altimus Prime Jun 28 '19 at 01:19
  • You also said "there are no special characters". While ignoring them could work (which I'm not familiar with how to do) I think it would be a bit more important to figure out what is causing it to fail in the first place. Just my opinion though. – kojow7 Jun 28 '19 at 01:22
  • Before asking the question I copied the file to my linux system and it's still processing. If I could have simply skipped the offending rows it would save so much time and given that previous MySQL versions seemed to allow it. I spent hours mapping the fixed width file to a table create statement followed by the insert statement. – Altimus Prime Jun 28 '19 at 01:30

0 Answers0