4

I came across an article about Join decomposition.

SCENARIO #1 (Not good):

Select * from tag
Join tag_post ON tag_post.tag_id=tag.id
Join post ON tag_post.post_id=post.id
Where tag.tag='mysql'

SCENARIO #2 (good):

Select * from tag where tag='mysql'

Select * from tag_post Where tag_id=1234

Select * from post where post.id in (123,456,9098,545)

It was suggested to stick to scenario #2 for many reasons specially caching. The question is how to join inside our application. Could u give us an example with PHP after retrieving them individually? (I have read MyISAM Performance: Join Decomposition? but it did not help)

Community
  • 1
  • 1
Alireza
  • 6,497
  • 13
  • 59
  • 132

1 Answers1

3

You COULD use an SQL subselect (if I understand your question). Using PHP would be rather odd while SQL has all the capabilities.

SELECT *
FROM `post`
WHERE `id` IN (
    SELECT `post_id`
    FROM `tag_post`
    WHERE `tag_id` = (
        SELECT `tag_id`
        FROM `tag`
        WHERE `tag` = 'mysql'
    )
)

I'm not sure how your database structure looks like, but this should get you started. It's pretty much SQL inception. A query within a query. You can select data using the result of a subselect.

Please, before copying this SQL and telling me it's not working, verify all table and column names.

Before anyone starts to cry about speed, caching and efficiency: I think this is rather efficient. Instead of selecting ALL data and loop through it using PHP you can just select smaller bits using native SQL as it was ment to be used.

Again, I highly discourage to use PHP to get specific data. SQL is all you need.


edit: here's your script

Assuming you have some multi-dimensional arrays containing all data:

// dummy results

// table tag
$tags = array(
    // first record
    array(
        'id'    => 0,
        'tag'   => 'mysql'
    ), 
    // second record
    array(
        'id'    => 1,
        'tag'   => 'php'
    )
    // etc
);

// table tag_post
$tag_posts = array(
    // first record
    array(
        'id'        => 0,
        'post_id'   => 0,   // post #1
        'tag_id'    => 0    // has tag mysql
    ),
    // second record
    array(
        'id'        => 1,
        'post_id'   => 1,   // post #2
        'tag_id'    => 0    // has tag mysql
    ),
    // second record
    array(
        'id'        => 2,
        'post_id'   => 2,   // post #3
        'tag_id'    => 1    // has tag mysql
    )
    // etc
);

// table post
$posts = array(
    // first record
    array(
        'id'        => 0,
        'content'   => 'content post #1'
    ),
    // second record
    array(
        'id'        => 1,
        'content'   => 'content post #2'
    ),
    // third record
    array(
        'id'        => 2,
        'content'   => 'content post #3'
    )
    // etc
);

// searching for tag
$tag = 'mysql';
$tagid = -1;
$postids = array();
$results = array();

// first get the id of this tag
foreach($tags as $key => $value) {
    if($value['tag'] === $tag) {
        // set the id of the tag
        $tagid = $value['id'];

        // theres only one possible id, so we break the loop
        break;
    }
}

// get post ids using the tag id
if($tagid > -1) { // verify if a tag id was found
    foreach($tag_posts as $key => $value) {
        if($value['tag_id'] === $tagid) {
            // add post id to post ids
            $postids[] = $value['post_id'];
        }
    }
}

// finally get post content
if(count($postids) > 0) { //verify if some posts were found
    foreach($posts as $key => $value) {
        // check if the id of the post can be found in the posts ids we have found
        if(in_array($value['id'], $postids)) {
            // add all data of the post to result
            $results[] = $value;
        }
    }
}

If you look at the length of the script above, this is exactly why I'd stick to SQL.

Now, as I recall, you wanted to join using PHP, rather doing it in SQL. This is not a join but getting results using some arrays. I know, but a join would only be a waste of time and less efficient than just leaving all results as they are.


edit: 21-12-12 as result of comments below

I've done a little benchmark and the results are quite stunning:

DATABASE RECORDS:
tags:           10
posts:          1000
tag_posts:      1000 (every post has 1 random tag)

Selecting all posts with a specific tag resulted in 82 records.

SUBSELECT RESULTS:
run time:                        0.772885084152
bytes downloaded from database:  3417

PHP RESULTS:
run time:                        0.086599111557
bytes downloaded from database:  48644



Please note that the benchmark had both the application as the database on the
same host. If you use different hosts for the application and the database layer,
the PHP result could end up taking longer because naturally sending data between
two hosts will take much more time then when they're on the same host.

Even though the subselect returns much less data, the duration of the requests is nearly 10 times longer...

I've NEVER expected these results, so I'm convinced and I will certainly use this information when I know that performance is important however I will still use SQL for smaller operations hehe...

Tim S.
  • 13,597
  • 7
  • 46
  • 72
  • tnx dude, I know about subselects, but the article's point was to join different selects inside your app rather than joining them with MySQL! – Alireza Dec 20 '11 at 10:25
  • why would you want to do it in your app when SQL provides all the tools you need? – Tim S. Dec 20 '11 at 10:30
  • **High performance MySQL** book says: You can decompose a join by running multiple single-table queries instead of a multitable join, and then performing join in the application! – Alireza Dec 20 '11 at 10:37
  • I think this is it. Although it is much longer than join, we do not need to go from app layer to Database layer. worth it. – Alireza Dec 20 '11 at 12:19
  • Even though I'd love to try to convince you that using an engine that natively has that feature is better, rather than trying to make your own in another engine, I'm still glad I could help hehe ;) – Tim S. Dec 20 '11 at 15:11
  • In my view I think we should do everything we can in app layer. we can say DB layer is a step further than app layer, and to go there u need to use more CPU cycle, memory, and specially network traffic. This cost is huge, really huge. you think just because your code is longer than a simple join it is more expensive, but in the system perspective your code will be run in a blink of an eye. well, I hope you're convinced now. – Alireza Dec 20 '11 at 15:25
  • I'm very sorry to disappoint you, but sending 3 entire tables back to the app layer requires more network traffic than sending the results processed by the database layer... However I can't speak about CPU cycle or memory. Time for a benchmark! – Tim S. Dec 21 '11 at 07:54
  • I've edited my answer with a benchmark with surprising results. No need to convince me anymore! – Tim S. Dec 21 '11 at 11:16