JOIN vs SELECT IN when querying arbitrary related tables

Question

When extracting data from multiple tables for mapping into a GraphQL result, which is more effective?

Pseudocode below.

Version 1 - Joining N tables together and parsing them into separate objects on the caller's side

alldata = SELECT a.*, b.*, c.*
    FROM aaaa a
    LEFT OUTER JOIN bbbb b on a.id = b.parent_id
    LEFT OUTER JOIN cccc c on b.id = c.parent_id
    WHERE a.name = 'my name';
/* Map flat DB rows to Java pojo-s */
aObj = parseRowsIntoJavaObjects(alldata)

Version 2 - Performing selects on tables 1 by 1, limiting them by parent row id-s

a_rows = SELECT a.* FROM aaaa a WHERE a.name = 'my name';
b_rows = SELECT b.* FROM bbbb b WHERE b.parent_id IN (a_rows.id);
c_rows = SELECT c.* FROM cccc c WHERE c.parent_id IN (b_rows.id);
/* Map flat DB rows to Java pojo-s */
bObj = parseRowsIntoJavaObjects(b_rows, c_rows);
aObj = parseRowsIntoJavaObjects(a_rows, bObj);

The tables are connected using foreign keys and can feature 1:1, 1:N and N:N relationships. Their structure is otherwise not defined - the solution has to work against arbitrary datamodels. Indexes can be added as needed and it can be assumbed that any necessary indexes are present for the query execution.

Asking because there can be any number of tables joined in this manner and I am concerned if Version 1 causes a problem for having to return the same data multiple times in several rows due to how many-to-one relationships work.

I know making several queries is generally a bad idea if things can be managed with one and IN() is not particularly fast. Is using IN here still a bad idea?

Since the solution has to be generic and apply to any number of datamodels, I don't want to just try it out and optimize it for a specific model like most questions dealing with JOIN... WHERE IN().. do. I would like to ask for a suggestion of what is the generally more performant way of loading relational data from arbitrary hierarchies up to.. say 3 levels deep?

The only sort-of matching question I found on SO dealt with a slightly different case but advocated both somewhat similar solutions in its answers: Select from multiple tables - One to Many relation

Database: MariaDB

It strongly depend from the amount of data and the indexes existing on the tables. As a general rule, JOINs performs better than INs because they can rely better on the existing indexes. INs are converted in OR instructions, so they are quite slow if the number of records increase significantly. — Nicola Lepetit, Jan 14 '20 at 08:52
@JoakimDanielson Yes, that can be done for tables at teh second level, but what about the 3rd level? Children of children? — Tarmo, Jan 14 '20 at 09:24
This is a duplicate of a zillion question about join vs where-in or exists. Which are special cases of a zillion questions re SQL "performance". An SQL query describes a result & not an algorithm. We can't tell you what is "best" for your details. Please ask a non-duplicate researched question. PS This is not clear. Explain "parseDataIntoObjects" clearly. Use enough words, sentences & references to examples. PS What constraints hold is not needed to query. What is necessary & sufficient is what a row in each table (base & result) says about the business situation in terms of column values. — philipxy, Jan 14 '20 at 09:24
@NicolaLepetit any indexes can be added to the table depending on which queries are actually used. You can assume the indexes are applied optimally (however realistic that is). The amount of data is exactly my worry - the queries need to support substantial data amounts but in those cases they would always be paginated — Tarmo, Jan 14 '20 at 09:25
@philipxy I am sorry but I did not find any answers which matched my question in about 30 minutes of searching. Perhaps I do not have enough knowledge of the subject matter to understand that some question is ACTUALLY the very same thing I asked. To me this seemed different as this is not intended for a specific datamodel but an approach to a generic solution. If you know of any matching aswer I would be very thankful if you would link it. I would be happy to remove my question in that case. — Tarmo, Jan 14 '20 at 09:30
I thought you were asking about join vs where-in but now I can't see what your title has to do with your question or what your question even is. Constraints do not determine what the queries must be. You seem to think something like that FKs allow us to reconstruct objects--this is just not so in general--as in, "1:1, 1:N and N:N". You seem to suffer from a common misconception. Explain what is on the table & your assumptions & what you are trying to achieve. We might be able to help with some special case you are assuming about designs. But as I say, constraints don't determine querying. — philipxy, Jan 14 '20 at 09:35
@philipxy Tables can have any data and any fields. The solution should work with different datamodels. I make no assumptions about the datamodel except that relations between tables are defined using FK-s. I do not think FK-s construct objects. They define relations between data in different tables. The parseRowsIntoJavaObjects method maps flat rows to hierarchical objects like in any ORM based on how the query was executed, the fields requested and returned. — Tarmo, Jan 14 '20 at 09:41
Your post is not clear. You seem to have some wrong assumptions that you are not expressing. Please edit your question to clearly say what you are trying to do. PS FKs are not relationships & they do not define relationships. Tables (base & result) represent relation(ship)s/associations. FKs just tell the DBMS that when certain values participate in a certain relationship they participate in a certain other once. Constraints including FKs, PKs, UNIQUE, cardinalities etc just tell the DBMS about invalid states. Not needed for querying. A query reconstructing objects is not determined by them. — philipxy, Jan 14 '20 at 09:46
[Join vs. sub-query](https://stackoverflow.com/q/2577174/3404097) etc etc. If you are just asking whether you should use A join B vs A where a.x in (...B ...) or where exists (...B...a.x...a.y....) then please don't mention objects & why you are joining & what constraints hold (FKs, cardinalities, etc), just ask about query syntax variants & google re that because it is a faq. Good luck. — philipxy, Jan 14 '20 at 10:08
What is the relevance of your comment on "eager fetching"? (Usually it is better to fetch what you need, not the entire dataset.) — Rick James, Jan 15 '20 at 17:28
Well, as far as I know many ORM-s that load child objects of the parent, do so lazily - when the subobject is requested a separate query is executed and the data is loaded and inserted into the parent (i guess it is configurable). I am mapping child objects into parent object like in an ORM but I want to load all child objects immediately, not when they are specifically requested. In a sense, I always do need all the data. The user has already defined what they need before I execute any of the queries. It is a generic GraphQL implementation of sorts — Tarmo, Jan 16 '20 at 08:33

Gordon Linoff · Accepted Answer · 2020-01-15T02:01:32.070

This is too long for a comment.

But, you should pull the data in the format that you need it in your application. If you want three different objects in Java for As, Bs, and Cs, then pull the data three times. If you want a single object that combines attributes from the tables then use a single query with JOIN.

Your description says that you want separate objects, so use separate queries. You really need to in SQL, because one query can only return one set of columns.

There are downsides to the first approach in this case. One is the size of each row -- each row contains columns for both Bs and Cs. This occupies space even if the values are NULL.

A bigger downside is the multiplication of results. If there are 10 matches in B and 20 matches in C for a given row in A, then your query will return 200 rows for that value. You will then have to go through a lot of de-duplication on the application side.

There is a small downside to running multiple queries -- overhead for running and compiling the queries and returning multiple data sets. I would not be concerned about that because three separate queries are a better fit for what you want to accomplish.

Thank you for taking the time to reply. In case of JOINs the idea was indeed to deduplicate the data into separate objects on Java side. The very thing I was worried about was the increased data size due to multiplication of rows that you mentioned. I will use multiple queries then. — Tarmo, Jan 14 '20 at 12:40

JOIN vs SELECT IN when querying arbitrary related tables

1 Answers1