Filesize: CSV vs MySQL

Question

I am trying to optimise my MySQL table structure for a 3GB CSV file. So far, I have managed to import 60% of the 19m+ rows, with a MySQL table size of 5.5GB. How could I optimise my table structure to reduce the database table size? (as I am running out of disk space!)

A sample line from the CSV file is

"{0C7ADEF5-878D-4066-B785-0000003ED74A}","163000","2003-02-21 00:00","UB5 4PJ","T","N","F","106","","READING ROAD","NORTHOLT","NORTHOLT","EALING","GREATER LONDON","A"

...and my database structure is:

(
`transaction_id` int(10) unsigned NOT NULL,
  `reference` varchar(100) COLLATE utf32_unicode_ci NOT NULL,
  `price` int(10) unsigned NOT NULL,
  `sale_date` date COLLATE utf32_unicode_ci NOT NULL,
  `postcode` varchar(8) COLLATE utf32_unicode_ci NOT NULL,
  `type` varchar(1) COLLATE utf32_unicode_ci NOT NULL,
  `new_build` varchar(1) COLLATE utf32_unicode_ci NOT NULL,
  `tenure` varchar(1) COLLATE utf32_unicode_ci NOT NULL,
  `property_number` varchar(10) COLLATE utf32_unicode_ci NOT NULL,
  `property_name` varchar(100) COLLATE utf32_unicode_ci NOT NULL,
  `street` varchar(100) COLLATE utf32_unicode_ci NOT NULL,
  `area` varchar(100) COLLATE utf32_unicode_ci NOT NULL,
  `city` varchar(100) COLLATE utf32_unicode_ci NOT NULL,
  `county1` varchar(100) COLLATE utf32_unicode_ci NOT NULL,
  `county2` varchar(100) COLLATE utf32_unicode_ci NOT NULL,
  `unknown` varchar(1) COLLATE utf32_unicode_ci NOT NULL
)

You might ask MySQL itself regarding your data! Try `SELECT * FROM yourtable PROCEDURE ANALYSE;` and have a look at the result. — Bjoern, Sep 13 '14 at 16:32
Why are you using `utf32` encoding? If this is just ASCII, use `utf8`. Here is a good explanation: http://stackoverflow.com/questions/496321/utf8-utf16-and-utf32. — Gordon Linoff, Sep 13 '14 at 16:32

score 4 · Answer 1 · answered Sep 13 '14 at 16:38

Let's look at the size of the fields.

Your database structure consists primarily of varchars. Which, under normal circumstances, should be about one byte per character in the CSV file. With the overhead for length, these should be about the same size or a little bigger (two bytes for length versus one for the comma). You might throw in a 10% fudge factor for storing in the database.

Integers could go either way. They could be a single digit in the CSV file (two characters with the comma) or several digits. They will occupy 4 bytes in MySQL. Dates are probably smaller in MySQL than in the CSV file.

There is additional overhead for indexes, particularly if you have a fill factor that leaves space on a data page for additional storage. There is additional overhead for other stuff on a data page. But, your tables seem much larger than would be expected.

My guess is that your table is much bigger because of the utf32 considerations. If you have no really good reason for this, switch to utf8.

As a note: normally varchar(1) not null can be replaced by char(1) or char(1) not null. This saves you on the encoding of the length, which is a big savings for such small fields. This is also a savings for other fields If you know the postal code is 8 characters, then define it as char(8) rather than varchar(8).

score 0 · Answer 2 · answered Sep 13 '14 at 16:34

Two suggestions:

(1) Your fields

You might ask MySQL itself regarding your data! Try

SELECT * FROM yourtable PROCEDURE ANALYSE;

and have a look at the result.

(2) Your charset

You're using utf32. If you don't really need it due to other parts of your tables/applications, go for utf8 instead.

Filesize: CSV vs MySQL

2 Answers2