5

I have this 2 mysql tables: TableA and TableB

TableA
* ColumnAId
* ColumnA1
* ColumnA2
TableB
* ColumnBId
* ColumnAId
* ColumnB1
* ColumnB2

In PHP, I wanted to have this multidimensional array format

$array = array(
    array(
        'ColumnAId' => value,
        'ColumnA1' => value,
        'ColumnA2' => value,
        'TableB' => array(
            array(
                'ColumnBId' => value,
                'ColumnAId' => value,
                'ColumnB1' => value,
                'ColumnB2' => value
            )
        )
    )
);

so that I can loop it in this way

foreach($array as $i => $TableA) {
    echo 'ColumnAId' . $TableA['ColumnAId'];
    echo 'ColumnA1' . $TableA['ColumnA1'];
    echo 'ColumnA2' . $TableA['ColumnA2'];
    echo 'TableB\'s';
    foreach($value['TableB'] as $j => $TableB) {
        echo $TableB['...']...
        echo $TableB['...']...
    }
}

My problem is that, what is the best way or the proper way of querying MySQL database so that I can achieve this goal?

Solution1 --- The one I'm using

$array = array();
$rs = mysqli_query("SELECT * FROM TableA", $con);
while ($row = mysqli_fetch_assoc($rs)) {
    $rs2 = mysqli_query("SELECT * FROM Table2 WHERE ColumnAId=" . $row['ColumnAId'], $con);
    // $array = result in array
    $row['TableB'] = $array2;
}

I'm doubting my code cause its always querying the database.

Solution2

$rs = mysqli_query("SELECT * FROM TableA JOIN TableB ON TableA.ColumnAId=TableB.ColumnAId");
while ($row = mysqli_fet...) {
    // Code
}

The second solution only query once, but if I have thousand of rows in TableA and thousand of rows in TableB for each TableB.ColumnAId (1 TableA.ColumnAId = 1000 TableB.ColumnAId), thus this solution2 takes much time than the solution1?

Glitch Desire
  • 14,632
  • 7
  • 43
  • 55
iSa
  • 1,144
  • 2
  • 9
  • 21
  • I assume that in solution 1, you do use a WHERE clause in the second query to only fetch rows from TableB related to the specific row in TableA? In that case, I personally would prefer solution 1, unless someone else responds with a better idea/analysis. – Tomas Creemers Aug 16 '13 at 13:14
  • @TomasCreemers - no, this is a very bad practice, to use cycled queries (see answer below) – Alma Do Aug 16 '13 at 13:15
  • @TomasCreemers Doing queries in loops is basically rule #2 of 'What not to do with SQL', behind inserting unsanitized user inputs into queries. – Glitch Desire Aug 16 '13 at 13:31
  • @SweetieBelle ... which would be one of the bad rules. There's a lot of reasons why using multiple queries can be better/faster than using a single query. Lock granularity is one, the MySQL optimizer being bad is another, temporary table overhead is still another one, and I could probably name a few others. – griffin Aug 16 '13 at 13:52
  • @griffin Lock granularity is unlikely to be an issue when doing a `SELECT` statement for reasons we have already discussed. Most likely, you won't want the database to change whilst you're reading data (for relational integrity reasons) so it will be better for any write operations to wait. It's **incredibly unlikely** that any of the other factors will add up to the same performance overhead as the cost of hundreds or thousands of connections, executions and responses, even over a UNIX socket. – Glitch Desire Aug 16 '13 at 14:27
  • @SweetieBelle Unfortunately I can't edit my comment anymore, but I would still suggest benchmarking in any case, as the numbers you get might drastically change as soon as you either do outer joins (left/right) or the query complexity increases (it's not uncommon on complex projects to have 6 or more joins in a query especially after normalizing tables for better memory usage) – griffin Aug 16 '13 at 14:39
  • I stoped reading after seeing that TableB has 2 id column –  Aug 16 '13 at 15:03
  • @2astalavista Clearly you don't understand compound `PRIMARY KEY` constraints (usually a `PRIMARY KEY` constraint over two or more `FOREIGN KEY`s in a weak entity) then. – Glitch Desire Aug 16 '13 at 15:13

4 Answers4

6

Neither of the two solutions proposed are probably optimal, BUT solution 1 is UNPREDICTABLE and thus INHERENTLY FLAWED!

One of the first things you learn when dealing with large databases is that 'the best way' to do a query is often dependent upon factors (referred to as meta-data) within the database:

  • How many rows there are.
  • How many tables you are querying.
  • The size of each row.

Because of this, there's unlikely to be a silver bullet solution for your problem. Your database is not the same as my database, you will need to benchmark different optimizations if you need the best performance available.

You will probably find that applying & building correct indexes (and understanding the native implementation of indexes in MySQL) in your database does a lot more for you.

There are some golden rules with queries which should rarely be broken:

  • Don't do them in loop structures. As tempting as it often is, the overhead on creating a connection, executing a query and getting a response is high.
  • Avoid SELECT * unless needed. Selecting more columns will significantly increase overhead of your SQL operations.
  • Know thy indexes. Use the EXPLAIN feature so that you can see which indexes are being used, optimize your queries to use what's available and create new ones.

Because of this, of the two I'd go for the second query (replacing SELECT * with only the columns you want), but there are probably better ways to structure the query if you have the time to optimize.

However, speed should NOT be your only consideration in this, there is a GREAT reason not to use suggestion one:

PREDICTABILITY: why read-locks are a good thing

One of the other answers suggests that having the table locked for a long period of time is a bad thing, and that therefore the multiple-query solution is good.

I would argue that this couldn't be further from the truth. In fact, I'd argue that in many cases the predictability of running a single locking SELECT query is a greater argument FOR running that query than the optimization & speed benefits.

First of all, when we run a SELECT (read-only) query on a MyISAM or InnoDB database (default systems for MySQL), what happens is that the table is read-locked. This prevents any WRITE operations from happening on the table until the read-lock is surrendered (either our SELECT query completes or fails). Other SELECT queries are not affected, so if you're running a multi-threaded application, they will continue to work.

This delay is a GOOD thing. Why, you may ask? Relational data integrity.

Let's take an example: we're running an operation to get a list of items currently in the inventory of a bunch of users on a game, so we do this join:

SELECT * FROM `users` JOIN `items` ON `users`.`id`=`items`.`inventory_id` WHERE `users`.`logged_in` = 1;

What happens if, during this query operation, a user trades an item to another user? Using this query, we see the game state as it was when we started the query: the item exists once, in the inventory of the user who had it before we ran the query.

But, what happens if we're running it in a loop?

Depending on whether the user traded it before or after we read his details, and in which order we read the inventory of the two players, there are four possibilities:

  1. The item could be shown in the first user's inventory (scan user B -> scan user A -> item traded OR scan user B -> scan user A -> item traded).
  2. The item could be shown in the second user's inventory (item traded -> scan user A -> scan user B OR item traded -> scan user B -> scan user A).
  3. The item could be shown in both inventories (scan user A -> item traded -> scan user B).
  4. The item could be shown in neither of the user's inventories (scan user B -> item traded -> scan user A).

What this means is that we would be unable to predict the results of the query or to ensure relational integrity.

If you're planning to give $5,000 to the guy with item ID 1000000 at midnight on Tuesday, I hope you have $10k on hand. If your program relies on unique items being unique when snapshots are taken, you will possibly raise an exception with this kind of query.

Locking is good because it increases predictability and protects the integrity of results.

Note: You could force a loop to lock with a transaction, but it will still be slower.

Oh, and finally, USE PREPARED STATEMENTS!

You should never have a statement that looks like this:

mysqli_query("SELECT * FROM Table2 WHERE ColumnAId=" . $row['ColumnAId'], $con);

mysqli has support for prepared statements. Read about them and use them, they will help you to avoid something terrible happening to your database.

Glitch Desire
  • 14,632
  • 7
  • 43
  • 55
  • 1
    On a sidenote: If you want to put more of the intersection etc logic into your application (PHP) code, you might want to consider using a different storage solution like a key-value store ([Tokyo Cabinet](http://fallabs.com/tokyocabinet/), [CouchDB](http://couchdb.apache.org/) etc) – griffin Aug 16 '13 at 14:30
  • @griffin Have updated my answer with why locking is inherently good, as I feel like your answer makes locking seem like an altogether bad thing (intentional or otherwise). – Glitch Desire Aug 16 '13 at 14:48
  • _locking_ is inherently BAD - atomicity is GOOD. You're confusing a positive side effect of locking being "good" with locking itself being good. Or not, but that's what I read from your updated answer. By the way that's also why lock free (or even wait free) algorithms win over locking algorithms most of the time (depending on overhead for the non-locking solution this might not hold of course) – griffin Aug 16 '13 at 14:57
  • @griffin No, locking protects integrity and ensures predictability. It has negative side effects WRT performance but (tuple, rather than table) locking is good. – Glitch Desire Aug 16 '13 at 14:59
  • 1
    Well which of those you see as side effect depends on perspective, which seems is different for me, coming from Multithreaded programming in C (where I always strife for lock-free wait-free solutions for increased performance) - e.g. `__sync_fetch_and_add` is "better" than a mutex init+lock+unlock+destroy in _most_ situations (ofc only if it's possible). Anyway, I'd at least reword it to "Locking _in MySQL_ is good ..." – griffin Aug 16 '13 at 15:03
  • Prepared statements have been proven to be slower than queries if only used once in _many_ benchmarks already, so if you don't want to provide additional info, you should probably prefer linking to an existing stackoverflow discussion regarding this instead of posting some generic statement. Also, there's no security problem with using string queries - the problem comes from inserting unsafe data into your sql statement, _which you can still do even with prepared statements!_ – griffin Aug 16 '13 at 15:10
  • @griffin I take your point: read operations preventing write operations in MySQL (or elsewhere, where data integrity and predictability is important) is good. – Glitch Desire Aug 16 '13 at 15:15
  • @griffin If the difference between a prepared statement and a standard query is likely to make the performance unacceptable, MySQL isn't fit for the job and you probably need an Oracle license. – Glitch Desire Aug 16 '13 at 15:16
  • If performance is of major concern I would use neither and use a C servlet webserver with inbuilt key-value storage, not having to pay horrible Oracle licenses. But that's beside the point. My argument is not about absolute performance, but relative performance with the given tools (PHP+MySQL), which you often may not chose yourself in business. And in that case string based queries are faster (and less complex to use as well). And the most important point to me: People often use PS and then concatenate unchecked user input into the query, thinking that PS would secure them anyway. – griffin Aug 16 '13 at 15:23
  • To me it's more important to teach people why something would be insecure, instead of throwing a tool at them and telling them "you're safe if you use this". – griffin Aug 16 '13 at 15:24
  • @griffin Totally agree, it's part of why I take exception to the 'don't use `mysql_*` spam when we pass them PDO or mysqli and they continue making the same mistakes. Having said that, prepared statements are far safer (even with unchecked input, the worst you should end up with is junk in your database since MySQL fixed the charset issues) and I feel like my link to bobby-tables.com does explain why concatenating is unsafe. – Glitch Desire Aug 16 '13 at 15:26
  • I probably worded my comment bad. Think about `$stmt = $dbh->prepare("INSERT INTO " .$_GET['unsafe'] . " (name, value) VALUES (:name, :value)");` That's what I was talking about. – griffin Aug 16 '13 at 15:32
  • @griffin Sure, I take your point, but the issue here is concatenation still and I feel like it's explained by the link. :P – Glitch Desire Aug 16 '13 at 15:33
2

Definitely second way. Nested query is an ugly thing since you're getting all query overheads (execution, network e t.c.) every time for every nested query, while single JOIN query will be executed once - i.e. all overheads will be done only once.

Simple rule is not to use queries in cycles - in general. There could be exceptions, if one query will be too complex, so due to performance in should be split, but in a certain case that can be shown only by benchmarks and measures.

Alma Do
  • 37,009
  • 9
  • 76
  • 105
  • While I agree that SQL was definitely made to do more complex, structured queries, real world numbers suggest that it sometimes is actually better to split up queries (lock contention, cacheability, ...). As your answer is the "general accepted truth" though, I'm not going to downvote it, but one should keep in mind that generic answers don't substitute real knowledge of specifics or benchmarking. – griffin Aug 16 '13 at 14:03
  • @griffin - yes, may be my answer is too common. All that you've said is true (and I'm aware of that) - so I've added that to answer, thank you. – Alma Do Aug 16 '13 at 14:06
2

If you want to do algorithmic evaluation of your data in your application code (which I think is the right thing to do), you should not use SQL at all. SQL was made to be a human readable way to ask for computational achieved data from a relational database, which means, if you just use it to store data, and do the computations in your code, you're doing it wrong anyway.

In such a case you should prefer using a different storage/retrieving possibility like a key-value store (there are persistent ones out there, and newer versions of MySQL exposes a key-value interface as well for InnoDB, but it's still using a relational database for key-value storage, aka the wrong tool for the job).

If you STILL want to use your solution:

Benchmark.

I've often found that issuing multiple queries can be faster than a single query, because MySQL has to parse less query, the optimizer has less work to do, and more often than not the MySQL optimzer just fails (that's the reason things like STRAIGHT JOIN and index hints exist). And even if it does not fail, multiple queries might still be faster depending on the underlying storage engine as well as how many threads try to access the data at once (lock granularity - this only applies with mixing in update queries though - neither MyISAM nor InnoDB lock the whole table for SELECT queries by default). Then again, you might even get different results with the two solutions if you don't use transactions, as data might change between queries if you use multiple queries versus a single one.

In a nutshell: There's way more to your question than what you posted/asked for, and what a generic answer can provide.

Regarding your solutions: I'd prefer the first solution if you have an environment where a) data changes are common and/or b) you have many concurrent running threads (requests) accessing and updating your tables (lock granularity is better with split up queries, as is cacheability of the queries) ; if your database is on a different network, e.g. network latency is an issue, you're probably better of with the first solution (but most people I know have MySQL on the same server, using socket connections, and local socket connections normally don't have much latency).

Situation may also change depending on how often the for loop is actually executed.

Again: Benchmark


Another thing to consider is memory efficiency and algorithmic efficiency. Later one is about O(n) in both cases, but depending on the type of data you use to join, it could be worse in any of the two. E.g. if you use strings to join (you really shouldn't, but you don't say), performance in the more php dependent solution also depends on php hash map algorithm (arrays in php are effectively hash maps) and the likelyhood of a collision, while mysql string indexes are normally fixed length, and thus, depending on your data, might not be applicable.

For memory efficiency, the multi query version is certainly better, as you have the php array anyway (which is very inefficient in terms of memory!) in both solutions, but the join might use a temp table depending on several circumstances (normally it shouldn't, but there ARE cases where it does - you can check using EXPLAIN for your queries)

griffin
  • 1,261
  • 8
  • 24
  • -1: Too long for a comment, but you didn't really answer the question. – Glitch Desire Aug 16 '13 at 13:49
  • @SweetieBelle It's not a comment, and it DOES answer the question. "Benchmark" is the answer, and I also state different advantages of multiple queries vs a single query, so it's just more indept answer than the other ones. – griffin Aug 16 '13 at 13:50
  • Benchmark is not an answer, it's a suggestion for how to get an answer. Unfortunately, in this case the answer is pretty obvious: the loop of queries will be slower in all cases. – Glitch Desire Aug 16 '13 at 13:51
  • @SweetieBelle ... which is not true. If you have multiple threads accessing the data, lock granularity is better if you split the queries, as is cacheability. So without knowing the rest of the OPs setup, the question is not possibly answerable with a "Solution 1 is faster in your case" or "Solution 2 is faster in your case". – griffin Aug 16 '13 at 13:54
  • There is no real world situation where submitting 200 queries is faster than submitting 1 query which does the same workload. This is a read-only operation so a deadlock is not going to occur (in MySQL read-locks are not exclusive). – Glitch Desire Aug 16 '13 at 13:55
  • @SweetieBelle I've built several CMS using PHP and MySQL in the past (thank god I don't need to use those "technologies" anymore), and there have been VERY mixed results regarding that. For locks see e.g. http://stackoverflow.com/questions/917640/any-way-to-select-without-causing-locking-in-mysql . Also, to mitigate your point in being "no real answer", I've added what I would prefer at the end of the post. – griffin Aug 16 '13 at 14:00
  • (assuming InnoDB) `SELECT` queries create a **non-exclusive read-lock** (to be more accurate, `READ COMMITTED`, by default) like I said. You can run 100 simultaneous threads which require read locks and they won't prevent each other from operating. There will be a potential conflict if you try to edit tuples which are read-locked during the operation, but that is *probably* desired behaviour given what the OP is trying to accomplish. Putting queries in a loop runs possible risk of relational integrity issues from data being edited inbetween loop cycles. – Glitch Desire Aug 16 '13 at 14:06
  • @SweetieBelle And with MyISAM, which has been the default engine in MySQL for very long, you get a table lock on a select query, so other queries ARE blocked. Of course, in that case the split solution would still block the whole secondary table with every query, but 1. the primary table would be free again and 2. each individual query would be shorter, and other queries could even be run between individual queries. Your comment about integrity is of course true, but I already stated that in my answer. – griffin Aug 16 '13 at 14:11
  • @SweetieBelle Anyway with the data provided by the OP, all of our answers amount to either "general accepted truths" or "experience", while the hard facts are only part of the explanations, so without knowing circumstances it's probably best to provide as much information as possible, which I'm trying to do with my answer. – griffin Aug 16 '13 at 14:13
  • MyISAM allows concurrent `SELECT` queries. I don't know of any database that by default that blocks concurrent read requests -- there's no logical reason to do so. MyISAM does table-lock instead of row/tuple locking, which would mean that *any* `SELECT` query will delay any write operations to the same table, but this is **probably desired behaviour** for the sake of maintaining relational integrity. – Glitch Desire Aug 16 '13 at 14:17
  • To be clear, if this were a write operation, I think your arguments would have merit. Submitting an eight-minute write query with MyISAM would be an issue and it may be better to break it up into smaller queries (I still wouldn't do this in a PHP loop). – Glitch Desire Aug 16 '13 at 14:19
  • @SweetieBelle Seems you're right on the read locks regarding MyISAM, I had false information there. Thank you for helping me learn yet another thing (I've not been using MyISAM for years, as it doesn't support transactions). Updated my answer to reflect behaviour regarding updates/selects. – griffin Aug 16 '13 at 14:23
  • I will remove my -1, but I don't agree with your conclusions regarding multiple queries. Any benchmark will show that even using a socket connection, it's vastly inferior especially when accounting for the fairly slow `mysqli` driver. – Glitch Desire Aug 16 '13 at 14:28
  • 1
    @SweetieBelle: is that still true if (for the sake of argument) TableA in this example would contain a BLOB column? Or the case where there are *many* columns in TableA you need and per row in TableA there are *many* rows in TableB, wouldn't that mean a *lot* of duplicate data being part of the result set? I have benchmarked several query scenarios where a loop gives better performance than a single, more complex query. – Tomas Creemers Aug 16 '13 at 14:38
  • @TomasCreemers If you don't need the `BLOB` column then don't select it. If you have duplicate data, select it once. You'd have to present the tables for us to know the ideal solution but you'll never find that a PHP loop achieves better results for getting the same data than subquerying in SQL. – Glitch Desire Aug 16 '13 at 15:02
  • 1
    @SweetieBelle subqueries aren't the same as JOINS, and most of the time require a temporary table, so your statement may not hold true for large amounts of data (performance normally also decreases with more memory usage, as then there is less memory free for caching or keeping tables in memory). – griffin Aug 16 '13 at 15:05
  • @TomasCreemers Read my answer for a better reason not to use a loop. – Glitch Desire Aug 16 '13 at 15:09
  • @griffin If memory is your major constraint in any application, you're doing it wrong. Disk IO and CPU are *far* more likely to be a major consideration than RAM for any query. Even for a 1,000,000 record table with each record at 1KB, that's only 1GB RAM in use if you throw the whole table in RAM. – Glitch Desire Aug 16 '13 at 15:11
  • @SweetieBelle I'd actually say it's the other way around - if you're CPU bound, your code is inefficient (yes, I know PHP is painfully slow), and if you're IO bound you're accessing the disk too much, which you shouldn't. And if you build for projects encompassing 300+ websites each one, with at least 5k new articles + other stuff (events, static pages, ...) + imported data (flickr, youtube, ...), you will hit the 1GB REALLY FAST. Now if you want to be able to server >1k visitors per second in hot phases, reading table data from disk is a fail (if you don't own several $k ssd raids). – griffin Aug 16 '13 at 15:17
  • @SweetieBelle + up to about 128GB memory is way cheaper than enterprise grade SSD raids (where the SSDs cost > 1k per piece). – griffin Aug 16 '13 at 15:18
  • @griffin By definition PHP is CPU inefficient. MySQL isn't exactly CPU efficient either when compared with enterprise-grade alternatives. Most applications will be reading SQL from either memory (in which case we can assume there's a lot of it) or 15kRPM SAS drives (in which case IO is a relatively fast bottleneck, but still a bottleneck). – Glitch Desire Aug 16 '13 at 15:20
  • If you want to be able to serve 15k+ visitors per second, I'd suggest caching your query every few seconds/minutes instead of running it for every hit, because no matter how efficient you make it, you're going to max out on memory building a multiple GB PHP array (required in either case, and shockingly memory inefficient) for every hit. – Glitch Desire Aug 16 '13 at 15:21
  • To be frank, if you're serving 15k hits/sec I'd suggest not using interpreted code and using Java or C++ instead to write your back-end. Also ditching MySQL for Oracle or even Mongo/friends. But that's another discussion altogether. – Glitch Desire Aug 16 '13 at 15:24
  • @SweetieBelle At least finally we agree ;) But I still have to nitpick there (sorry): There's no reason to require a PHP array for the data, you can just "render" the data as you fetch it. And I'd still prefer C servlets with [GWAN](http://gwan.com/) and its builtin KV Store over some slow VM like Java and pricey Oracle solution or the like. – griffin Aug 16 '13 at 15:27
  • 1
    @SweetieBelle: in my example the point is, of course, that the BLOB column *is* required, so it needs to be part of the SELECT. The problem, however, is that MySQL cannot just send the BLOB once, it must send it for EACH row in the joined table. If the BLOB is say 0.5MByte and there are 10 000 rows in the joined table that match per row in the parent table, this is an extra transfer of almost 5 GByte. This overhead will be much *larger* than that of one extra query per parent table row. – Tomas Creemers Aug 16 '13 at 15:28
  • @TomasCreemers Yes, if you don't optimize, but you could always omit the blob, then `SELECT blob FROM TableA`, then combine those blobs into your first array. This is 2 queries instead of however many parent rows you have and doesn't require you to compromise predictability. – Glitch Desire Aug 16 '13 at 15:30
  • @SweetieBelle And now you're talking about using multiple queries vs a single one - and if you extend that example to one with multiple joins, where the BLOB column is not in one of the tables which only have one row of interest, you already have several queries. Also, you should use transactions in that case, which naturally proves the point you made regarding logical integrity. – griffin Aug 16 '13 at 15:35
  • @griffin Yes, if you optimize all the way you may end up with multiple queries (or sub-queries) but there is *very rarely* an excuse to use a PHP loop with a query in it (I can't think of any example where this is right). – Glitch Desire Aug 16 '13 at 15:36
0

In some case, you should using limit for best performance

If you wanna show 1000 rows And some single query( master data)

you should run 1000 with limit between 10-100

Then get your foreign key to master data with single query with using WHERE IN in your query. then count your unique data to limit master data.

Example

Select productID, date from transaction_product limit 100

Get all productID and make it unique

Then Select price from master_product WHERE IN (1,2 3 4) limit 4(count from total unique)

foreach(transaction) master_poduct[productID]

  • This answer is confusing. Can you put your code into code blocks and explain better what you are suggesting? – Mike Mar 03 '17 at 19:18
  • i try to using PHP – helmi susanto Oct 19 '17 at 07:26
  • i try to using PHP and codeigniter simple way $this->db->limit(10); //try to use pagination for optimize result $query1 = $this->db->get("product_transaction"); foreach($query1->result() as $row) $uniq_ID [$row->ID]= $row->name; $this->db->limit(array_unique(array_keys($uniq_ID))); $this->db->where_in(array_unique(array_keys($uniq_ID))); $query2 = $this->db->get("product_master"); foreach($query2->result() as $row) echo $uniq_ID[$row->ID] – helmi susanto Oct 19 '17 at 07:32