-2

I am creating a database table with over 1.3 billion rows.

The stucture: ID, info1, info2, info3, city

In the 1.3 billion rows there are a lot of the same city's.
I was wondering, what would be faster (when looking up information with PDO PHP)

  • Save the city as plain
  • Save a ID (city id) and lookup the city id in a other table which will give me the plain city

What is faster, and why?

Thanks!

wtm
  • 166
  • 1
  • 13

1 Answers1

4

You have the right idea. Don't repeat the same city multiple times. It will slow down your queries in part because the city names will take up much more space in memory than integers. Larger tables take more time to process, and with almost a billion and a half rows, it should make a significant difference. See this posting.

Repeating the same name thousands of times also makes it a lot more likely that you'll introduce variations (typos) that make your query results unreliable. Finally, this design will make it much tougher to make changes to a city name.

Replace the plain city with a unique city_id for each city. In another city table, you'd have that same city_id and a plain name that holds the full city's name.

To profit from the change, you must let MySQL know of this relationship. When you create the tables, be sure to make city_id the primary key of the city table, and in your big table make city_id a foreign key.

This change will also allow you to expand your database more easily if you want to start tracking more info on a city. To learn more, look into database normalization.

Community
  • 1
  • 1
BeetleJuice
  • 39,516
  • 19
  • 105
  • 165
  • "It will slow down your queries" --- this statement needs some explanation. If you just look up by an exact value like `city = 'New York'` what exactly would make it slower than `city_id = 42`? – zerkms Jul 17 '16 at 23:43
  • @zerkms An index based on integer values will be smaller than one based on strings, and generally smaller is faster, or at least puts less pressure on the index cache in your database. This becomes especially important when using multiple indexes in a `JOIN` type situation. Additionally, comparing `INT` vs `INT` is extremely quick, basically one CPU instruction, but `VARCHAR` vs. `VARCHAR` is a lot more expensive and depends on the length of the strings and how close they are in terms of characters. – tadman Jul 18 '16 at 00:37
  • @tadman I meant - it should have been put into the answer. Answers about performance optimisations must have some evidence. Just "it will slow down" is never enough. – zerkms Jul 18 '16 at 00:39
  • There's no "evidence" that's useful here. The person asking the question must do their own benchmarks with their data, their hardware, and their general workload. We can only guess, but in general terms `INT` performs better. – tadman Jul 18 '16 at 00:40
  • @tadman your comment is evident enough. Explaining why it *might be* faster is still better than just "it will be faster trust me". – zerkms Jul 18 '16 at 00:40
  • @tadman btw, for a B+Tree the number of comparisons for 1.3B rows is upper bound by 31 (the worst case). So I hardly doubt that 30 varchar comparisons vs 30 integer comparisons will make any reasonable difference here, since it will likely to be IO that is slow. – zerkms Jul 18 '16 at 00:44
  • 1
    @zerkms The index on `VARCHAR` is usually limited to the first N characters, and if names frequently exceed this limit then you will easily get more than 31 comparisons. It's also a problem when for whatever reason the index cannot be used and a row-scan is required. I'm not sure if they've fixed this, but I've seen some seriously bizarre behaviour when using strings like UUIDs which are consistently over this index window. You can adjust it if necessary when creating your index, but it's still not going to get around the fact that a string index will be bigger. – tadman Jul 18 '16 at 00:45
  • @tadman and now you see how these comments are more useful than "It will slow down your queries" :-) – zerkms Jul 18 '16 at 00:50