Alternative to many-to-many (junction) table to reduce size

Question

My project https://simbiat.ru/fftracker has a table, that is a junction table containing IDs of achievements earned by each character ID. It works fine, but my concerns is its size: before optimization it was 6.1 GBs and after it - 4 GBs. And it's not the size of the index

Data    3.1 GiB
Index   901.5   MiB
Overhead    2.5 MiB
Effective   4.0 GiB
Total   4.0 GiB

I do not see any performance impact of it now, but I am concerned about the size, especially since the table is updated frequently and grows larger and larger. Even though it has just 3 columns (definition generated by phpMyAdmin):

CREATE TABLE `ff__character_achievement` (
  `characterid` int(10) UNSIGNED NOT NULL COMMENT 'Character ID taken from Lodestone URL (https://eu.finalfantasyxiv.com/lodestone/character/characterid/)',
  `achievementid` smallint(5) UNSIGNED NOT NULL COMMENT 'Achievement ID taken from Lodestone (https://eu.finalfantasyxiv.com/lodestone/character/characterid/achievement/detail/achievementid/)',
  `time` date NOT NULL COMMENT 'Date when achievement was received according to Lodestone'
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4 COLLATE=utf8mb4_unicode_ci ROW_FORMAT=COMPRESSED;


ALTER TABLE `ff__character_achievement`
  ADD PRIMARY KEY (`characterid`,`achievementid`) USING BTREE,
  ADD KEY `ach` (`achievementid`) USING BTREE;


ALTER TABLE `ff__character_achievement`
  ADD CONSTRAINT `char_ach_ach` FOREIGN KEY (`achievementid`) REFERENCES `ff__achievement` (`achievementid`) ON DELETE CASCADE ON UPDATE CASCADE,
  ADD CONSTRAINT `char_ach_char` FOREIGN KEY (`characterid`) REFERENCES `ff__character` (`characterid`) ON DELETE CASCADE ON UPDATE CASCADE;
COMMIT;

Considering that original tables are 408KBs and 707MBs and fill with lots of other data (varchars and text even), I would expect the junction table to be 2GBs with indexes at most. Yes, I can remove the date column, but it does not help much: 2.8 instead of 3.1GBs. Which seems weird to me, but that's the values reported by MariaDB.
The purpose of the table is to dynamically determine the rarest achievements as seen on https://simbiat.ru/fftracker/statistics/achievements/ and also generate random list of characters that have the achievements as seen on https://simbiat.ru/fftracker/achievement/1/ (as example). I think that for this purpose 4GBs for a table is too excessive.
So my question is, whether there may be some kind of way to store the same data and easily utilize it for the same purpose, but have smaller footprint?
From what I know it may be possible to store achievement IDs for each character as a JSON string, but I will definitely lose CONSTRAINT (which I prefer to retain) and I do not see an easy to use way to search through respective JSONs, unless doing a WHERE or FULLTEXT search, which does not seem to be appropriate for this.
Any suggestion will be appreciated.

I question whether the cascading delete and update are critical. FKs do cost something during inserts (even without the cascading). — Rick James, Mar 16 '21 at 18:42

score 1 · Answer 1 · answered Mar 16 '21 at 15:41

1

Keep in mind storing the data as JSON will probably make it larger, not smaller.

I ran some tests by converting a data dump from Stack Overflow into JSON data, and found it caused the size of the data to increase 2x - 3x. Integers are worse than strings in this way.

See my presentation How to Use JSON in MySQL Wrong, or the video. This applies to MariaDB as well.

The size of a row should be based on the size of the data types. Four bytes for INT, two bytes for SMALLINT, three bytes for DATE. There's a bit of overhead as you know. Let's call it 16 bytes per row. So for it to occupy 4GiB, you must have on the order of 270 million rows, right?

That's a fair bit of data, but it's not a deal-breaker. At my company, we have many tables that have over a billion rows, with much larger content than a couple of integers. We start to get a bit worried if the table is over 500GiB. At that point, operations like backups or schema changes become too difficult.

There might be ideas to shrink these rows like using a SMALLINT instead of an INT for the character id. But these ideas inevitably make your use of the data more complicated.

You have to ask yourself if it's going to make a difference. Even if you cut the size of the data types in half, pretty soon you'll just accumulate twice as many rows, and then you'll be back where you started.

Ultimately, you just have to decide if this database is worth investing into adequate server capacity, in storage and RAM.

answered Mar 16 '21 at 15:41

Bill Karwin

538,548
86
673
828

the data types were selected based on the conventions used on Lodestone and not to reduce space. They were the most suitable. I understand that 4GBs is not that much, it's just that I feel like it's too much for my purposes and perhaps i am missing some kind of an approach, that would allow me to reduce the the footprint, if possible. – Simbiat Mar 16 '21 at 17:43
Was my estimate of the row count accurate? I'm not sure if you have 270 million rows (more or less) what makes you think it should take less space? What is the average row size reported by `SHOW TABLE STATUS LIKE 'ff__character_achievement'\G`? – Bill Karwin Mar 16 '21 at 17:50
There are 142368136 rows. Average row length is 22. I understand why it takes this much right now and do not think this is wrong. I am thinking, that there may be an alternative to junction table, that would allow to store this data more efficiently. If there is none - then... Well, there is none. – Simbiat Mar 16 '21 at 18:02
1

I don't think there is a better option. A junction table is the right way to do this. Storing integers as a comma-separated string generally makes them take more than double the space. Also, [comma-separated strings should be avoided](https://stackoverflow.com/q/3653462/20860). – Bill Karwin Mar 16 '21 at 18:27
Thanks for the comment. I will wait some time, in case, someone has a suitable idea and then mark your answer accordingly. – Simbiat Mar 16 '21 at 18:38

Rick James · Accepted Answer · 2021-03-19T15:00:32.240

Analysis of Size

The 3 columns take this much for the data BTree and PRIMARY KEY:

4 bytes for INT
2 bytes for SMALLINT
3 bytes for DATE
Total, including overhead, is about 40 bytes.

Since the lone secondary index has two of those 3 columns, I would not expect it to be much smaller, so I am surprised by the ratio of 3.1 : 0.9 GiB

3.1GiB/40B ~= 80M -- Is that about how many rows in that table? So, my 40B guestimate is a bit high, but Bill's was a bit low. Still we are all "in the same ballpark" for this kind of computation. Caveat: The 142M is actually an approximation; the real value could easily be Bill's 270M or my 80. The real count is available only from COUNT(*) unless you are running a version of MariaDB that gives the exact number.

The FKs may be forcing an extra index on the reference tables, but do not impact this table (except for slowing down INSERTs).

Possible workaround

How many different "achievements" are there? If less than 64, we can discuss the use of SET or BIGINT UNSIGNED. Such a column could be added to the player table and consume only 8 (or fewer) bytes per player. Since I don't know what you are doing with the data, I can't be more specific. See also 1 << n, &, BIT_COUNT().

Please show us the SELECT that needs optimizing.

Another approach

Invent 3000 distinct 3-letter words. Have a TEXT column for each of the 142M rows; it would contain the relevant 'words'. Add FULLTEXT(col) to build a fulltext index of those "words". Now searching 142M rows for one or a small number of "id" "words" would be very efficient.

The space for the column would be 4 bytes per id per row (plus some overhead).

There are 142368136 rows. I am not concerned that it takes so much in this "approach". I am thinking, that perhaps there is an alternative way to store the same data for my purpose. Achievements table has a little bit shy from 3000 IDs, so BIGINT will not help here (and it would still require a WHERE LIKE to look up values, from what I understand. — Simbiat, Mar 17 '21 at 04:26
@Simbiat - More likely bit boolean operations and a table scan. — Rick James, Mar 17 '21 at 05:35

Alternative to many-to-many (junction) table to reduce size

2 Answers2