3

I run a small to medium car website and we are trying to log how many times a visit goes to vehicles detail page. We do this by hashing, md5, the make, model, and zip of the current vehicle. We then keep a vehicle_count total and increment this if the hashes match.

After running the numbers there appears to be about 50 makes, each make has about 50 models, and our locations db has about 44,000 unique zip codes. Roughly 100 million+ potential of unique hashes

This is the create table:

CREATE TABLE `vehicle_detail_page` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `vehicle_hash` char(32) NOT NULL,
  `make` varchar(100) NOT NULL,
  `model` varchar(100) NOT NULL,
  `zip_code` char(7) DEFAULT NULL,
  `vehicle_count` int(6) unsigned DEFAULT '1',
  PRIMARY KEY (`id`),
  UNIQUE KEY `vehicle_hash` (`vehicle_hash`),
  KEY `make` (`make`),
  KEY `model` (`model`),
  KEY `zip_code` (`zip_code`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

This is the PHP code to insert/update the table:

public function insertUpdate($make, $model, $zip)
{
    // set table
    $table = self::TABLE;        
    // create hash
    $hash = md5($make.$model.$zip);

    // insert or update count
    try
    {
        $stmt = $this->db->conn->prepare("INSERT INTO $table
                                                (vehicle_hash, 
                                                    make, 
                                                    model, 
                                                    zip_code)
                                          VALUES
                                                (:vehicle_hash, 
                                                    :make, 
                                                    :model, 
                                                    :zip_code)
                                          ON DUPLICATE KEY UPDATE
                                                    vehicle_count = vehicle_count + 1;");
        $stmt->bindParam(':vehicle_hash', $hash, PDO::PARAM_STR);
        $stmt->bindParam(':make', $make, PDO::PARAM_STR);
        $stmt->bindParam(':model', $model, PDO::PARAM_STR);
        $stmt->bindParam(':zip_code', $zip, PDO::PARAM_STR);
        $stmt->execute();
    } catch (Exception $e)
    {
        return FALSE;
    }

    return TRUE;
}

Questions:

  1. Can MySQL handle this many rows?
  2. Does anyone see anything wrong with this code, and is there a better way to do this?
  3. What will querying this data be like?

The Big question is, once this table grows how will that php function above perform. If/when that table has a few million+ rows, how will that table perform. Can anyone give some insight?

user1050544
  • 437
  • 5
  • 24
  • with enough server power and enough patience, MySQL can handle anything....just not neccessarily fast – Marshall Tigerus Apr 25 '14 at 20:20
  • 7
    `What will querying the data be like?` for the love of doge, do not forget the `WHERE` clause – MonkeyZeus Apr 25 '14 at 20:20
  • MySQL is perfectly capable of handling this, though why don't you normalize make and model properly? – Mark Baker Apr 25 '14 at 20:20
  • 3
    1) RTFM: http://dev.mysql.com/doc/refman/5.0/en/table-size-limit.html 2) Why store hashes of the values? Why not just a normal table with (make,model,zip,visit_count) as normal fields? 3) querying for what? presumably `select ...`. – Marc B Apr 25 '14 at 20:21
  • Since the table is small (relatively few columns) can anyone guesstimate the size this table might consume? – user1050544 Apr 25 '14 at 20:21
  • Size should be 1 byte per character if you stick with standard ASCII and then add up the MySQL metadata – MonkeyZeus Apr 25 '14 at 20:23
  • What would a recommended hash be then? The Where clause would only include the indexed columns. ya a common query would be: SELECT make, model, zip_code, vehicle_count FROM bestride_vdp HAVING vehicle_count > 10 ORDER BY vehicle_count DESC – user1050544 Apr 25 '14 at 20:24
  • Get yourself about 8 SSDs and RAID 1+0 those suckers – MonkeyZeus Apr 25 '14 at 20:24
  • @user1050544 Don't hash at all. Make a composite index of all the fields that go into the hash. – Barmar Apr 25 '14 at 20:24
  • Doesnt the unique index on hash speed things up. What is a composite index? – user1050544 Apr 25 '14 at 20:27
  • http://dev.mysql.com/doc/refman/5.0/en/multiple-column-indexes.html – Bacon Bits Apr 25 '14 at 20:27
  • As an alternative to a composite index, you can introduce a column that is “hashed” based on information from other columns. If this column is short, reasonably unique, and indexed, it might be faster than a “wide” index on many columns. In MySQL, it is very easy to use this extra column: SELECT * FROM tbl_name WHERE hash_col=MD5(CONCAT(val1,val2)) AND col1=val1 AND col2=val2; Isnt this what i am doing just in php code...it says its faster? – user1050544 Apr 25 '14 at 20:31
  • Yes, the hash index will be faster, since it will be smaller. Make it even smaller by using a single-byte character set. Still, don't forget the `WHERE` clause. – Marcus Adams Apr 25 '14 at 20:37
  • 1
    Have you thought about using Google Analytics instead? It works fast - is an industry standard - is free and can do exactly what you want. – h2ooooooo Apr 25 '14 at 22:04

1 Answers1

1

You could also avoid the hash altogether.

CREATE TABLE `vehicle_visits` (
  `make` varchar(100) DEFAULT NULL,
  `model` varchar(100) DEFAULT NULL,
  `zip_code` char(7) DEFAULT NULL,
  `vehicle_count` int(11) DEFAULT NULL,
  UNIQUE KEY `make_model_zip` (`make`,`model`,`zip_code`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

This avoids having multiple UNIQUE values. Instead of "ID" and "Hash", you can use real world values to create the UNIQUE identifier. Notice how MySQL can use 3 columns to form a unique index.

Note: to decrease the size of your index, you can decrease the size of make and model columns. Unless you are expecting to have 100 character make and model name of course. If you are worried about size, you can also create an index using a prefix of each of the columns.

Edit: adding the hash column as an index method

As an alternative to a composite index, you can introduce a column that is “hashed” based on information from other columns. If this column is short, reasonably unique, and indexed, it might be faster than a “wide” index on many columns. http://dev.mysql.com/doc/refman/5.0/en/multiple-column-indexes.html

You will need to do some real world tests to see which method is quicker. Since the data shows about 50 makes and 50 models, the lookup will mostly involve the zip_code column. Index order also makes a difference. Also, creating an index using prefixes such as make(10), model(10), zip(7), creates an index of length 27. On the other hand, an md5 column would be 32.

The hash method may help with lookups, but will it really help with real world applications? This table seems to track visitors, and will most likely have analytics performed on it. The index will help with SUM() operations (depending on the order of the index). For example, if I want to find the total number of visitors to "Honda" or "Honda Civic" page, it is easily done with the multiple column index.

stomo21
  • 280
  • 2
  • 5
  • I should think converting each of those three fields into foreign keys would save some space, too! – halfer Apr 25 '14 at 21:07
  • Do you have any concrete evidence of that being faster. This is taken directly from mysql.com: As an alternative to a composite index, you can introduce a column that is “hashed” based on information from other columns. If this column is short, reasonably unique, and indexed, it might be faster than a “wide” index on many columns. In MySQL, it is very easy to use this extra column: SELECT * FROM tbl_name WHERE hash_col=MD5(CONCAT(val1,val2)) AND col1=val1 AND col2=val2; Isnt this what i am doing just in php code...it says its faster? – user1050544 Apr 26 '14 at 17:38
  • @user1050544, thanks, added your method to the answer. sorry no evidence, but I believe it will always require real world testing to know for sure. – stomo21 Apr 26 '14 at 18:55