0

I am having essentially the same problem as described here but the issue was left unresolved in that question.

I am trying to import a series of data files totaling about 100 million records into a MariaDB database. I've run into issues with some lines in the import file that look like:

"GAYATRI INC DBA "WHIPIN"","1950","S I","","AUSTIN","TX","78704","5124425337","B","93"

which I was trying to load with a statement like:

LOAD DATA INFILE 'testline.txt'
INTO TABLE data
FIELDS TERMINATED BY ',' ENCLOSED BY '"'
LINES TERMINATED BY '\r\n'
(@name,@housenum,@street,@aptnum,@city,@state,@zip,@phone,@business,@year)
SET name=@name, housenum=@housenum, street=@street, aptnum=@aptnum, city=@city, state=@state, zip=@zip, phone=@phone, business=@business, year=@year;

but am receiving errors because the first field contains unescaped double quotes in the text of the field. That seems to be OK in and of itself as the database seems smart enough to handle that in most situations. However, because the field ends with a double quote in the text plus a double quote to close the field it assumes the first double quote is escaping the second double quote following RFC4180 and thus is not terminating the field even though the next character is a comma.

The source files can't be created any differently as they are exports from old software which I do not control. Obviously searching through 100 million records and changing entries like this by hand is not feasible. I'm unsure of whether any fields might contain commas though it's probably safe to assume they do in this quantity of records so programmatically forcing fields to break at commas is probably out too.

Any ideas on how to get them to import correctly?

Ben Franske
  • 330
  • 2
  • 11
  • Does adding `OPTIONALLY` before the `ENCLOSED BY` make any difference? – Paul T. Jul 22 '20 at 02:38
  • No, same result. The first field is imported as: `GAYATRI INC DBA "WHIPIN","1950` and the second as `S I` etc. with the last field (which should import as 93) being null because two fields are bring included in the first field. – Ben Franske Jul 22 '20 at 04:47
  • Time to learn a programming language that includes a robust REGEXP processor. – Rick James Jul 22 '20 at 05:40
  • I thought that might be the case and am familiar enough with a few languages to do CSV regex processing but am struggling a little with two things related to that. 1) How to split things with a regex without running into the same issue that the database is, i.e. what rules will be needed and 2) Then how to format the re-written file so that it can be imported into the database. I suppose I could change the field enclosures to a character very unlikely to occur in the dataset maybe like a tilde... – Ben Franske Jul 22 '20 at 16:11
  • Were you originally getting a 1064 error? ... looking at a user suggestion at the maridDB docs, see if moving the `ENCLOSED BY` before the terminated clauses might help: `ENCLOSED BY '"' FIELDS TERMINATED BY ','`. If that still doesn't work, I have another idea. – Paul T. Jul 23 '20 at 00:17
  • I had to make it `FIELDS ENCLOSED BY '"' TERMINATED BY ','` as you need the FIELDS before ENCLOSED. However, the result was the same. – Ben Franske Jul 23 '20 at 18:31

0 Answers0