3

I have been writing the results from an algorithm that calculates distances between customers in a InnoDB table. For example if my customers were A, B, C and D, the table in the database looks like this, among other columns:

From | To    | Distance
  A     B        344
  A     C        274
  A     D        182
  B     C        338

And so on... It is a lot of rows I think I will hit 50 million.

The other columns are product_type and value. Those tell me how much the customer B (customer_to in the columns) buys of that product_type. That means that I have each pair multiple times depending on how many product_types the customer B buys.

I needed to to a query to group each customer with the products his neighbors buys and the value. The query looks like this:

select customer_from, product_type, avg(value) as opportunity
from customer_distances
where distance < 500
group by customer_from, product_type
order by opportunity desc; 

The innodb table could not answer me that query. Despite I changed the net_read_timeout to 28800, the mysql connection was lost during the query.

I tough it had something to do with innodb build for transactional processing and not for intensive queries. So I created a new table with MyIsam as engine and insert-select all the records from the innodb table.

As expected, the select was very fast (70 segs) and all other selects like count( distinct customer_from), where almost instantaneous.

Just for curiosity I tried to continue the process of inserting the distances in the myisam table. It was a surprise for me when the program started to run at least 100 times faster than when it was working on the innodb table -for INSERTS!

For each customer the program inserts something like 3000 rows (one for each neighbor for each product_type. Something like 300 neighbors and 10 product_types per customer). With the innodb table inserting a single customer took something between 40 and 60 seconds (aprox. 3000 rows). With the myisam table, it takes 1 second to insert 3 customers (9000 rows aprox).

Some extra information:

So in summary the question is: Why is MyISAM that fast with insert statements? What do you think?

EDIT 1: I'm adding the create statements for both tables, the innodb and myisam. EDIT 2: I deleted some unuseful information and formated a little bit here and there.

/* INNODB TABLE */
CREATE TABLE `customer_distances` (
  `customer_from` varchar(50) NOT NULL,
  `customer_from_type` varchar(50) DEFAULT NULL,
  `customer_from_segment` varchar(50) DEFAULT NULL,
  `customer_from_district` int(11) DEFAULT NULL,
  `customer_from_zone` int(11) DEFAULT NULL,
  `customer_from_longitud` decimal(15,6) DEFAULT NULL,
  `customer_from_latitud` decimal(15,6) DEFAULT NULL,
  `customer_to` varchar(50) NOT NULL,
  `customer_to_type` varchar(50) DEFAULT NULL,
  `customer_to_segment` varchar(50) DEFAULT NULL,
  `customer_to_district` int(11) DEFAULT NULL,
  `customer_to_zone` int(11) DEFAULT NULL,
  `customer_to_longitud` decimal(15,6) DEFAULT NULL,
  `customer_to_latitud` decimal(15,6) DEFAULT NULL,
  `distance` decimal(10,2) DEFAULT NULL,
  `product_business_line` varchar(50) DEFAULT NULL,
  `product_type` varchar(50) NOT NULL,
  `customer_from_liters` decimal(10,2) DEFAULT NULL,
  `customer_from_dollars` decimal(10,2) DEFAULT NULL,
  `customer_from_units` decimal(10,2) DEFAULT NULL,
  `customer_to_liters` decimal(10,2) DEFAULT NULL,
  `customer_to_dollars` decimal(10,2) DEFAULT NULL,
  `customer_to_units` decimal(10,2) DEFAULT NULL,
  `liters_opportunity` decimal(10,2) DEFAULT NULL,
  `dollars_opportunity` decimal(10,2) DEFAULT NULL,
  `units_oportunity` decimal(10,2) DEFAULT NULL,
  PRIMARY KEY (`cliente_desde`,`cliente_hasta`,`grupo`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

/* MYISAM TABLE */
CREATE TABLE `customer_distances` (
  `customer_from` varchar(50) NOT NULL,
  `customer_from_type` varchar(50) DEFAULT NULL,
  `customer_from_segment` varchar(50) DEFAULT NULL,
  `customer_from_district` int(11) DEFAULT NULL,
  `customer_from_zone` int(11) DEFAULT NULL,
  `customer_from_longitud` decimal(15,6) DEFAULT NULL,
  `customer_from_latitud` decimal(15,6) DEFAULT NULL,
  `customer_to` varchar(50) NOT NULL,
  `customer_to_type` varchar(50) DEFAULT NULL,
  `customer_to_segment` varchar(50) DEFAULT NULL,
  `customer_to_district` int(11) DEFAULT NULL,
  `customer_to_zone` int(11) DEFAULT NULL,
  `customer_to_longitud` decimal(15,6) DEFAULT NULL,
  `customer_to_latitud` decimal(15,6) DEFAULT NULL,
  `distance` decimal(10,2) DEFAULT NULL,
  `product_business_line` varchar(50) DEFAULT NULL,
  `product_type` varchar(50) NOT NULL,
  `customer_from_liters` decimal(10,2) DEFAULT NULL,
  `customer_from_dollars` decimal(10,2) DEFAULT NULL,
  `customer_from_units` decimal(10,2) DEFAULT NULL,
  `customer_to_liters` decimal(10,2) DEFAULT NULL,
  `customer_to_dollars` decimal(10,2) DEFAULT NULL,
  `customer_to_units` decimal(10,2) DEFAULT NULL,
  `liters_opportunity` decimal(10,2) DEFAULT NULL,
  `dollars_opportunity` decimal(10,2) DEFAULT NULL,
  `units_oportunity` decimal(10,2) DEFAULT NULL,
  PRIMARY KEY (`cliente_desde`,`cliente_hasta`,`grupo`)
) ENGINE=MyISAM DEFAULT CHARSET=utf8;
Community
  • 1
  • 1
Dago Borda
  • 35
  • 1
  • 6
  • Were the tables otherwise identical (indexes, etc...)? And there are separate server settings (such as memory cache sizes) for the engines that could affect performance. – Uueerdo Jul 14 '16 at 22:47
  • You should normalise the product sales data. The table doesn't make much sense as presently designed. – user207421 Jul 14 '16 at 22:52
  • Identical tables, same servers. – Dago Borda Jul 14 '16 at 23:22
  • I need the data like that for reporting purposes. I need it to be faster and if I normalize the data, then the database would need to to joins in order to build the results. – Dago Borda Jul 14 '16 at 23:23
  • 1
    Questions about performance normally need a CREATE for all relevant tables, and an EXPLAIN – Strawberry Jul 14 '16 at 23:30
  • 1
    You won't convince me that denormalizing the data like this and doubling or tripling the size of a table with 50 million rows is really going to make your reports run significantly faster. It is equally probably that they will run slower. And how are you going to maintain this data? – user207421 Jul 15 '16 at 00:09
  • In the transactional database the information is normalized and the data is changing as the business is running. This table was created specifically for that report. That is a tableau viz with a map where you will be able to see customers that are close to each others and their transactions by product type. Queries will be like `select * from customer_distances where customer_from = '238483'`. I may be wrong, I don't know for sure, but I think is faster for the database to just return all the registers with the condition than joining customers with the distances and the sales. – Dago Borda Jul 15 '16 at 04:11
  • I will add the Create sentences for both tables Strawberry. I don't know what the EXPLAIN is. – Dago Borda Jul 15 '16 at 04:13
  • "Same servers" does not matter....well it does, but it is not what I was talking about. MySQL has separate settings, on a single server instance, for the two engines (as touched on in the second answer posted). – Uueerdo Jul 15 '16 at 20:43
  • You are right I didn't understand your comment. There should be for sure different settings between engines. I haven't worked on that project for some time, but as soon as I have an answer I'll post the results in here. – Dago Borda Aug 01 '16 at 15:36

1 Answers1

4

Inserts

  • InnoDB, by default, "commits" each INSERT immediately. This can be remedied by clumping 100-1000 rows at a time.
  • Batching inserts will speed up both MyISAM and InnoDB - perhaps by 10x.
  • Learn about autocommit and BEGIN..COMMIT.

Select

  • InnoDB consumes more disk space than MyISAM -- typically 2x-3x; this impacts table scans, which you are probably
  • For that query, a composite index on (customer_from, product_type, distance) would probably help both engines.

Tuning

  • When running just MyISAM, set key_buffer_size to 20% of RAM and innodb_buffer_pool_size=0.
  • When running just InnoDB, set key_buffer_size to only 10M and innodb_buffer_pool_size to 70% of RAM.

Normalization and saving space

  • Smaller --> more cacheable --> less I/O --> faster (in either engine)
  • DECIMAL(10,2) is not the best in most cases. Consider FLOAT for non-money (such as distance). Consider fewer digits; that handles up to 99,999,999.99, and takes 5 bytes.
  • It is usually not a good idea to have replicated columns, such as the 10 columns of customer_from and customer_to. Have a Customers table, with both in it.
  • Each of your latitud and longitud are 7 bytes and have unnecessary resolution. Suggest latidud DECIMAL(6,4) and longitud (7,4), for a total of 7 bytes. (These give 16m/52ft resolution.)

Result

After those suggestions, the 50M-row table will be very much smaller, and run very much faster in both engines. Then run the comparison again.

Rick James
  • 135,179
  • 13
  • 127
  • 222