0

I am trying to move my file based organizing json files to mariadb. Approximately there are 2,000,000 json files where in my file based system are zipped. The total storage space for the zipped json files is 7GB.

When i inserted all the records to Mariadb the table storage became 35GB. i altered my table to be compress and the table size is 15GB. Is there a way to reduce even more the table size?

Is it normal for the storage to double when data is added to mariadb?

this is my table

CREATE TABLE `sbpi_json` (
  `fileid` int(11) NOT NULL,
  `json_data` longtext COLLATE utf8_bin NOT NULL,
  `idhash` char(32) COLLATE utf8_bin NOT NULL,
  `sbpi` int(15) NOT NULL,
  `district` int(2) NOT NULL,
  `index_val` int(2) NOT NULL,
  `updated` text COLLATE utf8_bin NOT NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin ROW_FORMAT=COMPRESSED;

ALTER TABLE `sbpi_json`
  ADD PRIMARY KEY (`fileid`),
  ADD UNIQUE KEY `idhash` (`idhash`),
  ADD KEY `sbpi` (`sbpi`);
Pradeep
  • 9,667
  • 13
  • 27
  • 34
  • What is the encoding of the json files in the file system? What is the updated field? 16GB - is that the space for the data only or does that include the various MySQL log files as well for innodb? – Shadow Sep 11 '19 at 15:45
  • json files are UTF-8. the updated field is used for on duplicate mysql. the characters for the updated field are less than 10. – Pantelis Livadiotis Sep 11 '19 at 15:50
  • 1
    *"Is it normal for the storage to double when data is added to mariadb?"* You can't compare on disk zipped "text" files with innoDB files, both solve different problems... Also you are comparing different compression algorithms... Also in InnoDB you add more columns plus the innoDB engine will add information on how to use/read the records exists and also add records headers. – Raymond Nijland Sep 11 '19 at 15:54
  • The main reason i am making this transfer, from file based to mariadb is due to performance. i used a php script to access the file and echo it to user but i notice there was a big delay (get zip, unzip, find the latest json file and echo it). alternative solution to speed up the process is to move to mariadb but the storage space increased significantly. Any suggestions to minimize the storage size are welcome – Pantelis Livadiotis Sep 11 '19 at 16:04
  • Why didn't you compress the file individually? Then your process would be: find, unzip, echo – Paul Spiegel Sep 11 '19 at 19:17
  • each zip contains a version of each file as also an additional json file with the charcteristict of each version ie date created,hash,index value - the larger the latest. by having a single zip to access and maintain its simplier than having multiple zips. – Pantelis Livadiotis Sep 12 '19 at 08:39
  • @PantelisLivadiotis - Doesn't that mean "unzip old version, add new file, then rezip"? – Rick James Sep 18 '19 at 23:56

2 Answers2

1

The JSON column in question is json_data, correct? It averages (uncompressed) about 10KB, correct? In the file implementation, there are multiple 'versions' of each, correct? If so, how do you tell which one you want to deliver to the user?

  • Most compression techniques give you 3:1; InnoDB compression gives you 2:1. This is partially because it has things that it can't (or won't) compress.
  • Compressing just the JSON column (in client code) and storing it in a MEDIUMBLOB will probably take less space in InnoDB than using COMPRESSED. (But this will not be a huge savings.)
  • Focus on how you pick which 'version' of the JSON do deliver to the user. Optimize the schema around that. Then decide on how to store the data.
  • Given that the table can efficiently say which file contains the desired JSON, then that will be the best approach. And use some normal, fast-to-uncompress technique; don't focus on maximal-compression.
  • If char(32) COLLATE utf8_bin is a hex string, use ascii, not utf8.
  • If it is hex, then UNHEX to further shrink it to only BINARY(16).
  • When a row is bigger than 8KB, some of the data (probably json_data) is stored "off-record". This implies an extra disk access and disk allocation is a bit more sloppy. Hence, storing that column as a file ends up taking about the same amount of time and space.
  • The OS probably allocates space in 4KB chunks. InnoDB uses 16KB blocks.
Rick James
  • 135,179
  • 13
  • 127
  • 222
  • Yes the json_data is the column that significantly increases the storage space. Yes average 10KB. Yes multiple version of each file. The latest is the one to deliver. the hash value is used to compare if there is a change in the file and everytime it changes the index_value increases, therefore the largest index_val is the latest version. I have tried compressing single column and blob datatype but the storage space does not change significantly. I was not expecting for the php zip algorithm to be so different than sql compress – Pantelis Livadiotis Sep 12 '19 at 08:46
-1

It's the text type that takes too much space. You can try to replace it with a smaller variant of text type if you can give for granted that that much lenght is ok. Also replacing char(32) with varchar(32) will help if those values are not always full lenght.

Or you can go with varchar even for the textual field, but keep eyes on what's on this answer before doing so.

Hope I helped!

Sterconium
  • 559
  • 4
  • 20
  • 3
    I did not downvote, but this answer is incorrect. All of the variable-length types, VARCHAR, VARBINARY, BLOB, TEXT, and all their cousins do not take more space depending on the type. They only store the length of the string value you put into them. – Bill Karwin Sep 11 '19 at 15:20
  • ι can not reduce to smaller datatype. i tried using text and not longtext and some json data could not be added. the largest detected so far is 2mb uncompress – Pantelis Livadiotis Sep 11 '19 at 15:21
  • Text fields' actual storage requirement is primarily dependant on the data you store in those fields. Just because there is a huge difference in the potential max size of data, it does not mean that there is a big difference if small bits of data are stored in those fields. idhash name indicates that the field will store a hash, which will have a constant length. – Shadow Sep 11 '19 at 15:25
  • 2MB needs `MEDIUMTEXT`. – Rick James Sep 11 '19 at 17:40