1

Given three tables as follows,

Table item_attributes:

item_id store color brand
123 ABC red kings
456 ABC blue kings
111 XYZ green queens

Table item_cost:

item_id store currency price
123 ABC usd 2.34
111 XYZ usd 9.21
122 ABC usd 6.31

Table item_names (main table)

item_id store name
123 ABC Short sleeve t-shirt
111 XYZ Plaid skirt

The goal is to combine selected columns using AWS Athena from item_cost (price) and item_attributes (brand) to the main item_names tables based on the item_id as the main join key.

I've tried this and it kinda works:

SELECT 

att.item_id, 
att.store,
att.brand,

cost.item_id,
cost.store,
cost.price, 

main.item_id, 
main.store, 
main.name, 

FROM (item_name main LEFT OUTER JOIN item_attributes att) LEFT OUTER JOIN item_cost cost
ON main.item_id=att.item_id AND main.item_id=cost.item_id
WHERE main.store=cost.store AND main.store=att.store

But my question (in parts are):

  • Is there a simpler/clean query to perform the join without increasing the complexity (Big-O)?
  • Would it be computationally cheaper if I do nested join of main JOIN att then JOIN price? Or would it be the same as the chained LEFT OUTER JOIN as per the code above?
  • Would it be computationally cheaper if I main UNION att UNION price then group by item_id?
John Rotenstein
  • 241,921
  • 22
  • 380
  • 470
alvas
  • 115,346
  • 109
  • 446
  • 738
  • why you need is a FULL OUTER JOIN in all cases, as it seems that there are rows wiht missing data on all sides. even when you make it onoly for attributs and cost , because item_name must have at least one corresponding item_id – nbk Feb 24 '23 at 22:27
  • You can use [Using EXPLAIN and EXPLAIN ANALYZE - Amazon Athena](https://docs.aws.amazon.com/athena/latest/ug/athena-explain-statement.html) to determine efficiency of the queries. How big are these tables to make you worry about query efficiency? Also, what is the storage format of the underlying data? It might be more 'efficient' to change the storage format to Snappy-compressed Parquet than to change the query. – John Rotenstein Feb 24 '23 at 22:31
  • The number of rows are in 100 millions to billions for all 3 tables. They are parquet. – alvas Feb 24 '23 at 22:34
  • LEFT JOIN returns INNER JOIN rows UNION ALL unmatched left table rows extended by NULLs. Always know what INNER JOIN you want as part of an OUTER JOIN. After a LEFT JOIN a WHERE, INNER JOIN or HAVING that requires a right [sic] table column to be not NULL removes any rows with introduced NULLs, ie leaves only INNER JOIN rows, ie "turns OUTER JOIN into INNER JOIN". You have that. Suggesting you might not understand left join and/or be getting what you want. – philipxy Feb 25 '23 at 01:59
  • [How much research effort is expected of Stack Overflow users?](https://meta.stackoverflow.com/q/261592/3404097) [mre] [Tips for asking a good SQL question](https://meta.stackoverflow.com/q/271055/3404097) [Asking query performance questions](https://dba.meta.stackexchange.com/q/3034/43932) [How to Optimize Queries in a Database - The Basics](https://stackoverflow.com/q/3191623/3404097) [What are your most common sql optimizations?](https://stackoverflow.com/q/1332778/3404097) [When and why are database joins expensive?](https://stackoverflow.com/q/173726/3404097) PS Please ask 1 question. – philipxy Feb 25 '23 at 02:03

1 Answers1

1

A query joining multiple tables with millions of rows is going to be resource-intensive, but I assume you are doing this to create a 'master' table with joined data.

You can use Using EXPLAIN and EXPLAIN ANALYZE in Athena - Amazon Athena to test the 'efficiency' of the queries. I would also recommend using Snappy-compressed storage files, which are faster to retrieve from disk (and also cheaper, since Amazon Athena charges based on the amount of data read from disk).

Your joining query might be as simple as:

SELECT ...
FROM item_names
JOIN item_attributes USING (item_id, store)
JOIN item_cost USING (item_id, store)

However, if there are missing values you might need to use LEFT OUTER JOIN.

John Rotenstein
  • 241,921
  • 22
  • 380
  • 470
  • I'm still working out how to effectively process by data using `EXPLAIN ANALYZE`, will get back if I find anything interesting given the data sizes. – alvas Feb 25 '23 at 12:35
  • The nice things about `USING` is that the 'joined' column only appears once in the output, so you don't have to dot-specify which table it comes from. Only works when both tables have the same named for the 'joined' column(s). – John Rotenstein Feb 26 '23 at 03:38