1
  • I have about 800 000 (and this number is growing) clients post or get requests to php page including max 10 parameters.
  • I need to save it in MySQL DB for live statistic page.
  • What is the best way to do it without loosing any data?

I have this code idea, but I think it too slowly:

<?php

$mysql = \mysql::_init();

$data = ( \request::is_post ) ? \request::get_post_data : \request::get_get_data;

clean_and_validate( $data );

$mysql->query("
    INSERT INTO `userStat`
    ( `userToken`, `userPosition`, `app`, `build` ) 
    VALUES ( '" . implode( "', '", $data ) .  "' )
");

function cleadn_and_validate( &$data ) {
    //validate params here
}

Or this way (save just raw data):

<?php

    $mysql = \mysql::_init();

    $rawData = ( \request::is_post ) ? \request::get_raw_post_data : \request::get_raw_get_data;

    $mysql->query("
        INSERT INTO `userStat`
        ( `data` ) 
        VALUES ( '" . $mysql->real_escape_string( json_encode( $rawData) ) . "' )
    ");
jim M
  • 11
  • 3
  • a couple of notes (not answering your question, but comments on your code): [you don't need to pass by reference in php](http://stackoverflow.com/questions/2157799/when-to-pass-by-reference-in-php). also, your code is open to sql injection. you should be using [prepared statements](http://stackoverflow.com/questions/1290975/how-to-create-a-secure-mysql-prepared-statement-in-php). – devlin carnate Nov 09 '15 at 19:04
  • function clean_and_validate do work for prepare all inserting data – jim M Nov 09 '15 at 19:07
  • The `mysql` functions are deprecated, so I would recommend you to use MySQLi or PDO anyway. Its also easy to do mistakes in the validation. Best way to be sure theres can be no SQL injection is to use data binding. – Anders Nov 09 '15 at 19:13
  • I have a few thoughts: First, I think your insert statement would be fast enough, depending on your indexes and the ultimate size of the table. Where you might run into issues is querying the stats table to get your updated statistics (per day?). Since your are talking about around 300 million records per year, you may want to consider partitioning your table by day (assuming your will be querying it by day). Also, this is a lot of mySQL, you may want to consider some other database technology (mongoDB can scale and is free). – Joel Nov 09 '15 at 19:14
  • Using data binding (so the query is always the same) might help MySQL cashing the execution plan. Also, a stored procedure might help. But I am only guessing here. – Anders Nov 09 '15 at 19:19
  • This kind of problem can be solved by brute force. First move the statistics data to a separate MySQL on a different server (you can have more than one MySQL connection), then dimension and tune the "statistics" server properly. Frankly, I'd consider using MongoDB or another NoSQL system. – LSerni Nov 09 '15 at 19:28
  • Whatever you decide to do, just **don't do** what @lserni suggested, it's basically telling you to drink sea to quench your thirst. It's a dumb move, you have several viable options - from creating a queueing mechanism to flush multiple inserts at once to running the inserts as they are with a proper transactional table engine (InnoDB, TokuDB). – N.B. Nov 09 '15 at 21:44
  • @N.B. , I totally agree on the "dumb"! There are several efficient *and not simple* - nor simplistic - ways of tackling this kind of problem. Starting from *defining it*: for how can anyone plan for a DB or another, without - for example - *even knowing what statistics are asked for*? (What if, by absurd, we discovered that those 800K rows could be replaced by a Redis counter updated every so often?). Without domain intelligence, I can only suggest the dumb solution: *get a bigger hammer*. Doing otherwise would be worse than dumb: it would be *foolish*. Also... *if it's dumb and it works...* – LSerni Nov 09 '15 at 22:19
  • @lserni - the fact one **needs** to have no data loss rules your big hammer out. Also, the fact the OP already uses MySQL / is more proficient with MySQL implies that another tool that requires time to learn is not helpful. The *dumb* part I mentioned reflected on the suggestion, not you, I apologize if I've offended you. It's just not feasible, in my opinion, to just waltz in and change the hammer - it's not that easy :) – N.B. Nov 09 '15 at 22:25
  • Don't worry, I'm not offended. But note that the boldfacing on 'needs' is yours. I wouldn't stake large sums on the OP having done an evaluation of data integrity. Anyone will tell you he wants 0% risk of data loss (who would say otherwise?). The acid test is, how much is one willing to pay (in backups, redundancy etc.) to *ensure* that 0%. Here, this has not yet been done. As for the time required to learn -- it's a cost; but would doing things in a sub-optimal way be any less of a cost? [the actual answer being, *we cannot know yet*]. – LSerni Nov 09 '15 at 22:36

2 Answers2

1

It's important to realize that there are only so many options you have with your code - this issue is mostly related to utilising the hardware you have optimally.

The problem: insert 800 000 records across 24 hours having no record loss.


Potential issue: you haven't mentioned what the peak insert rate would be (all 800k records could occur within the same hour for example) or you could have an even spread of ~10 inserts per second (quite low really).

The real problem behind this is utilising the permanent storage device or the hard drive. You said you need no record loss - this is only possible with a transactional engine and it rules NoSQL out so please don't consider it at all. At the moment you have InnoDB and TokuDB as transactional engines, with InnoDB being the default engine.

If you have a mechanical hard drive, you need to measure how many IOPS it's capable of (Input Output Operations per Second). This number varies from drive to drive. It tells you how many times per second it can read or write.

For an SSD this number is much higher, some drives are capable of 100 000 IOPS.

This unit of measurement is important because it tells you how many times a transactional engine will be able to force the disk to store the data. Non-transactional engines don't force the disk to store the data, so OS is capable of scheduling the writes - this is why NoSQLs are famous for losing data - they tell you it's written but it's in a buffer which can get lost before the drive commits the data.

Now, you have to determine whether you're fine with the IOPS you have available or not. One query = 1 I/O. This will tell you how many inserts per second you are capable of doing. Queueing inserts and then sending them to the DB to write down in 1 I/O is an optimization technique, people use it when they want to trade the bandwidth of the disk for the I/O. Basically, it means grouping several inserts (a 100 or so) and then you send it in a BEGIN TRANSACTION / COMMIT block which lets you write down a lot of data in a single I/O. It's a bit tricky to implement this queueing mechanism because you have to code a small long-running program which is capable of queueing the requests / querying the db.

After you've determined what you have available in terms of processing power, you can start doing things right, like the nice guys at the comments mentioned - do it properly.

  • Use PDO and use prepared statements. This step is crucial. Not only is it easier for the database to process, it's easier for you and it yields the performance you desire.

  • Measure the potential performance of the server you are using.

  • Optimize the hardware if possible - you can only do so much with the code. If you produce the best possible code in the universe, your constraining factor will be the hardware and no matter what kind of software you stick there, if a limit is hit - that's it. At this point you are looking either to increase the hardware processing power or to split the data into multiple shards.

These are only a few tips, as you can see - this topic is pretty huge. However, I believe I gave you a few things to google about. Whatever you decide to do - good luck!

N.B.
  • 13,688
  • 3
  • 45
  • 55
  • 1
    I'm upvoting the valuable hardware tips. Yet, if I were you, I would consider the possibility of facing a X-Y problem here. As stated in my comment to the question, I believe that first thing we (and the OP) should ask ourselves *what those data are going to be used for, how, how often, etc.*. – LSerni Nov 09 '15 at 22:24
  • @lserni - you're correct in your statement, it's difficult to suggest the proper hammer if the nails are unknown. – N.B. Nov 09 '15 at 22:28
0
  • Do it properly, at least use the PDO interface to Mysql from PHP.
  • On suitable hardware (memory, and disk-speed), 800,000 hits to PHP and inserts into a Mysql server is not a problem. I'd expect to do multiples of that per hour on decent hardware and optimised servers
Alister Bulman
  • 34,482
  • 9
  • 71
  • 110