2

I am loading a csv file into an empty table with success and no errors. When comparing the number of original rows (from viewing .csv in external application and from the Output Response) with the number of rows in my resulting table (from Table Inspector) seems like not all rows are importing. Table Inspector reports that the table has 1,416,824 rows while the original csv has 1,419,910 rows. There should be no replicated primary keys in the data set though it should have error'd out on those lines in my mind if that were the case.

Table structure:

CREATE TABLE `table1` (
  `pkfield` varchar(10) NOT NULL,
  `field1` varchar(3) DEFAULT NULL,
  `field2` varchar(1) DEFAULT NULL,
  `field3` varchar(1) DEFAULT NULL,
  PRIMARY KEY (`pkfield`),
  UNIQUE KEY `pkfield_UNIQUE` (`pkfield`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

Import command:

LOAD DATA INFILE 'c:/table1.csv'
INTO TABLE myschema.table1
FIELDS TERMINATED BY ','
ENCLOSED BY '"'
LINES TERMINATED BY '\r\n';

MySQL Workbench Response:

1419910 row(s) affected Records: 1419910  Deleted: 0  Skipped: 0  Warnings: 0

Sample from csv file (data mocked up):

6623950258,XYZ,A,B
5377103432,XYZ,A,B
9131144416,XYZ,A,A
1326703267,XYZ,A,B
7847786312,XYZ,A,B
1119927042,XYZ,B,B
4144055385,CDE,A,B
4563489252,XYZ,A,B
5733611912,XYZ,A,B
3309418377,XYZ,A,B
6928148128,XYZ,A,B
1152657670,XYZ,A,B
8143082292,CDE,A,B
9373340750,XYZ,A,A
3318949288,XYZ,A,B
1166427684,XYZ,A,B
5062296807,XYZ,B,A
4624323293,XYZ,A,B
3088992643,CDE,A,B
6477504847,XYZ,A,B

Any suggestions or explanations would be greatly appreciated. Kind regards.

nbayly
  • 2,167
  • 2
  • 14
  • 23
  • The only idea I have is to `SELECT INTO OUTFILE` and then `diff` the two .csv files. – Darwin von Corax Mar 16 '16 at 21:41
  • How would I diff the 2 files though? The whole reason why I moved this into MySQL was because of the size of the files. How can I make a line by line comparison to determine what rows aren't making it into my import? – nbayly Mar 16 '16 at 22:08
  • Excel could manage that amount of data, provided you have 64 bit and a good amount of memory. Consider you could have some wrong formatted rows. – White Feather Mar 16 '16 at 22:12
  • On *nix it would be something like `diff (sort table1.csv) (sort table1.out.csv)`. I'd missed the fact that you're on Windows, though, and I've no idea how to do it there. – Darwin von Corax Mar 16 '16 at 22:17
  • @WhiteFeather Does Excel 64 bit not have a 1048576 row limit as well? – nbayly Mar 16 '16 at 22:19
  • @DarwinvonCorax don't think I specified but yes I am on Windows 7 environment. Regards – nbayly Mar 16 '16 at 22:20
  • Yes as stated here https://support.office.com/en-gb/article/Excel-specifications-and-limits-1672b34d-7043-467e-8e27-269d656771c3, Excel 2010 has that limit, but you can split the file and start investigation. You can use, let's say, first 1,000,000 rows and see what rows are not loading... For example, you can use the remove duplicates function loading 500.000 rows from both files and see what remains... – White Feather Mar 16 '16 at 22:23
  • I don't know Table Inspector. Is it accurate? I mean does it really do a `select count(*) from table` or is the rowcount merely a statistical value to get an idea of how big the table approximately is? – PerlDuck Mar 16 '16 at 22:30
  • You can look at differences between files on Windows following what is in this answer: http://stackoverflow.com/questions/6877238/what-is-the-windows-equivalent-of-the-diff-command – White Feather Mar 16 '16 at 22:31
  • A colleague suggested uploading a smaller subset of the data to test. I will try this tomorrow and update with any information I can gather. At this point I wish to avoid using Excel as it was giving me lots of issues opening any significant sized files which is the reason I turned to MySQL. Still thanks for the suggestions and hope to hear some more. – nbayly Mar 16 '16 at 22:31
  • @PerlDog Table Inspector is a built in tool in workbench from the contextual menu when clicking the table name. The Action Output reports it as `ANALYZE TABLE 'myschema'.'table1'` – nbayly Mar 16 '16 at 22:34
  • Hmm. I just read the [docs](http://dev.mysql.com/doc/refman/5.7/en/analyze-table.html) and am not sure. Can you execute `select count(*)`? I only trust that. – PerlDuck Mar 16 '16 at 22:43
  • @PerlDog Ahh, just learning MySQL on the fly so still learning the basic commands. Will try this tomorrow morning. Thank you for the suggestion. – nbayly Mar 17 '16 at 00:00
  • @PerlDog running that command returns the expected result of 1419910. The question seems kinda silly now. Still not sure why those stats are wrong from the Table Inspector. If you submit an answer with that note I will mark it as a correct response. Regards – nbayly Mar 17 '16 at 15:16
  • Glad I could help. This isn't a silly question because it's still not clear why this happend. But thank you for the offer :-) _[It's an offer I can't refuse](https://www.youtube.com/watch?v=SeldwfOwuL8)_. – PerlDuck Mar 17 '16 at 20:14

2 Answers2

3

Honestly, I'm myself not sure why the number of rows isn't accurate after a fresh import of a table. I think the Table Inspector fetches the data from some statistics table and to my understanding that gets updated only when the table changes by more than 10 %. Perhaps this is the reason. However, the accurate amount of rows can always be fetched with traditional

select count(*) from myschema.table1;

As @nbayly told, this gives the expected result of 1419910 rows which matches the number LOAD DATA had reported.

PerlDuck
  • 5,610
  • 3
  • 20
  • 39
0

Honestly, for now, I didn't know why this happened. But I knew a solution to this. At first, I thought It might be because of special characters present in the csv records. but even after removing those still, I was receiving not all records. I also noticed that records don't need to be in thousand to be missed by load data. It even happens in a few hundred records.

So for now the only reliable solution is to import using phpMyAdmin. Please remove special characters etc before importing, and also remove new lines from fields header etc.

phpMyAdmin seems to have some sort of parser that tokenized the csv file and then create SQL from those tokens. So it does not use at all load data command and hence it did import corrects.

The downside is it is through GUI and all those slowness that comes with point and click etc.