MySQL comparing two tables rows hash

Question

I have two tables one is master table and other is just for cache. From time to time I check whether cache table is up to date and there is no missing data. Cache table is using MyISAM engine and master table is using InnoDB engine.

To explain it in more detail I give you an example

Cache table contains fields from following two tables

product_categories (cat-id, cat_name, parent_cat_id DEFAULT NULL, parent_cat_name DEFAULT NULL) 

products (product_num, product_name, product_desc, price, image, product_date, availability)

It may be possible cache table does not contain products or it may contain products data but it may not be accurate.

In this question Compare two MySQL databases a tool Toad for MySQL has mentioned but I want to do it using PHP.

Cache table schema

products_cache | CREATE TABLE `products_cache` (
  `product_num` int(10) unsigned NOT NULL AUTO_INCREMENT,
  `cat_id` int(10) unsigned NOT NULL,
  `parent_cat_id` int(10) unsigned DEFAULT NULL,
  `cat_name` varchar(50) NOT NULL,
  `parent_cat_name` varchar(50) DEFAULT NULL,
  `product_desc` text NOT NULL,
  `price` float(10) unsigned NOT NULL,
  `image` varchar(65) NOT NULL DEFAULT '',
  `product_date` DATE DEFAULT NULL, 
  `availability` tinyint(1) NOT NULL DEFAULT '1',
  PRIMARY KEY (`product_num`),
) ENGINE=MyISAM

Possible solution

Compute the md5 of the fields and store it in cache table and then next time check the md5 in cache table if data is changed. It will work fine except there will be performance issue (I run cache fixer every month, so I think I can compromise with that). Please comment on this.

Could please post the output of `SHOW CREATE TABLE cache_table`? — Quassnoi, Nov 10 '10 at 13:54
I dont understand your motivation for storing the same data twice - why do you do this ?? — Jon Black, Nov 10 '10 at 14:00
any proof that a simple join doesn't perform as required? That's pretty hard to imagine. — sfussenegger, Nov 10 '10 at 15:13
In the question I said its the sample scenario. In the real we have 4 table joins, which makes queries slow performing and require more table scanning. Consider the categories table a product can be in parent category or child category. If it is in child category we need two joins or two queries. The site is quite busy so I am now making a cache table for storing data to reduce load from main table. Also the main table updates frequently (i.e. rating, views, etc) which I do not need in cache table and it may improve performance greatly. — Maximus, Nov 10 '10 at 15:22

score 1 · Answer 1 · answered Nov 10 '10 at 16:48

1

Instead of computing MD5 sums for all of your data every month you could simply record changes to a table using triggers.

CREATE TABLE changes (
  table char(30) NOT NULL, -- TODO use an enum for better performance
  id int NOT NULL,
  UNIQUE KEY tableId (table, id),
) 

CREATE TRIGGER insert_products AFTER INSERT ON products FOR EACH ROW INSERT IGNORE INTO changes (table, id) values ("products", OLD.id);
CREATE TRIGGER update_products AFTER UPDATE ON products FOR EACH ROW INSERT IGNORE INTO changes (table, id) values ("products", OLD.id);
CREATE TRIGGER delete_products AFTER DELETE ON products FOR EACH ROW INSERT IGNORE INTO changes (table, id) values ("products", OLD.id);

CREATE TRIGGER insert_product_categories AFTER INSERT ON product_categories FOR EACH ROW INSERT IGNORE INTO changes (table, id) values ("product_categories", OLD.id);
CREATE TRIGGER update_product_categories AFTER UPDATE ON product_categories FOR EACH ROW INSERT IGNORE INTO changes (table, id) values ("product_categories", OLD.id);
CREATE TRIGGER delete_product_categories AFTER DELETE ON product_categories FOR EACH ROW INSERT IGNORE INTO changes (table, id) values ("product_categories", OLD.id);

-- do this for every involved table

once in a while, you could than update changed rows (in a nightly batch job) (pseudo code):

for {table,id} in query(select table, id from changes) {
  cacheRow = buildCacheRow($table, $id)
  doInTransaction {
    query(replace into product_cache values $cacheRow)
    query(delete from changes where table = $table and id = $id)
  }
}

answered Nov 10 '10 at 16:48

sfussenegger

35,575
15
95
119

This is great but it won't work in my case because products table updates very frequently i.e. views, rating etc – Maximus Nov 10 '10 at 17:02
@jason why do you think that's a problem? – sfussenegger Nov 10 '10 at 17:12
@sfussenegger because it is going to record every update which I don't need. For example views, rating and other unnecessary information which I don't need in cache table. At the end of the month I will have the all products ids in changes table. I want to record only changes that are approved by staff. Rather I want to ensure that the cache table contains accurate data. – Maximus Nov 10 '10 at 17:19
@jason 2 possibilities a) check if one of the columns you're interested in changes its value or b) split your table in half. One for data that's more or less static and another one for data that changes frequently. This way, you might already fix the performance issues you have right now (faster reads due to higher cache hit rates and probably faster writes as well). – sfussenegger Nov 10 '10 at 17:28
@jason if I wasn't clear enough: I'd strongly recommend to split your table before you start denormalizing your data. if the problem persists, you can still evaluate throwing in some caches. – sfussenegger Nov 10 '10 at 17:30
Thanks sfussenegger i will definitely consider your recommendation :) – Maximus Nov 10 '10 at 17:35

MySQL comparing two tables rows hash

Possible solution

1 Answers1