Performance of joins on multi-million-row tables

Question

I need to give my website users the ability to select their country, province and city. So I want to display a list of countries, then a list of provinces in the selected country, then a list of cities in the selected province (I don't want any other UI solution for now). Of course, every name must be in the user's language, so I need additional tables for the translations.

Let's focus on the case of the cities. Here are the two tables:

CREATE TABLE `city` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `province_id` int(10) unsigned DEFAULT NULL
  PRIMARY KEY (`id`),
  KEY `idx_fk_city_province` (`province_id`),
  CONSTRAINT `fk_city_province` FOREIGN KEY (`province_id`) REFERENCES `province` (`id`)
) ENGINE=InnoDB;

CREATE TABLE `city_translation` (
  `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `city_id` int(10) unsigned NOT NULL,
  `locale_id` int(10) unsigned DEFAULT NULL,
  `name` varchar(255) DEFAULT NULL
  PRIMARY KEY (`id`),
  KEY `idx_fk_city_translation_city` (`city_id`),
  KEY `idx_fk_city_translation_locale` (`locale_id`),
  KEY `idx_city_translation_city_locale` (`city_id`,`locale_id`),
  CONSTRAINT `fk_city_translation_city` FOREIGN KEY (`city_id`) REFERENCES `city` (`id`),
  CONSTRAINT `fk_city_translation_locale` FOREIGN KEY (`locale_id`) REFERENCES `locale` (`id`)
) ENGINE=InnoDB;

The city table contains 4 millions rows and the city_translation table 4 millions × the number of the languages available on my website. This is 12 millions now. If in the future I want to support 10 languages, it will be 40 millions...

So I am wondering: is it a bad idea (performance wise) to work with a table of this size, or is a good index (here on the join fields, city_id and locale_id) sufficient to make the size not matter?

If not, what are the common solutions used to solve this specific --but I guess common-- problem? I'm only interested in performance. I'm ok to denormalize if necessary, or even to use other tools if they are more appropriate (ElasticSearch?).

The accepted answer [here](http://stackoverflow.com/questions/173726/when-and-why-are-database-joins-expensive?rq=1) has some good info — Clive, Jan 25 '16 at 18:06
Paris is "Paris" in several languages. You could implement a fall-back system where if city.name doesn't have translation then use alternate — Neil McGuigan, Jan 25 '16 at 23:13
"So I am wondering: is it a bad idea (performance wise) to work with a table of this size, or is a good index here on the join fields?" Why don't you try it, and let us know what the results are. — Gilbert Le Blanc, Jan 26 '16 at 00:50

score 4 · Accepted Answer · answered Jan 25 '16 at 22:06

Get rid of id in city_translations. Instead have PRIMARY KEY(city_id, locale_id). With InnoDB, this may double the speed because of cutting out an unnecessary step in the JOINs. And you can shrink the disk footprint by also removing the two indexes starting with city_id.

Do you think you will go beyond 16M cities? I doubt it. So save one byte by changing (in all tables) city_id to MEDIUMINT UNSIGNED.

Save 3 bytes by changing locale_id to TINYINT UNSIGNED.

Those savings are multiplied by the number of columns and indexes mentioning them.

How big are the tables (GB)? What is the setting of innodb_buffer_pool_size? How much RAM is there? See if you can make that setting bigger than the total table size and yet no more than 70% of available memory. (That's the only "tunable" that is worth checking.)

I hope you have a default of CHARACTER SET utf8mb4 for the sake of Chinese users. (But that is another story.)

Thanks for the tips, @Rick. However, my question was about wether it is or not a bad idea to work with such big tables or if it is generally better to try to do it another way. What I need are elements to go for one model design or another, criteria to weigh the pros and cons, etc. Or even something so basic as: "from x records up, forget it" ;-) I would love to read your advices on these points in your answer. — marcv, Jan 26 '16 at 21:18
"For a billion records up, it gets exciting." No, I don't say "forget it". What I failed to say is "your 40M rows is fine". I simply launched into listing some tips on making it work better. What you have described (so far) is moderately bulky, but 'simple'. — Rick James, Jan 26 '16 at 22:07

Performance of joins on multi-million-row tables

1 Answers1