Efficiently joining selected columns from multiple tables in Athena based on a unique item_id key

Question

Given three tables as follows,

Table item_attributes:

item_id	store	color	brand
123	ABC	red	kings
456	ABC	blue	kings
111	XYZ	green	queens

Table item_cost:

item_id	store	currency	price
123	ABC	usd	2.34
111	XYZ	usd	9.21
122	ABC	usd	6.31

Table item_names (main table)

item_id	store	name
123	ABC	Short sleeve t-shirt
111	XYZ	Plaid skirt

The goal is to combine selected columns using AWS Athena from item_cost (price) and item_attributes (brand) to the main item_names tables based on the item_id as the main join key.

I've tried this and it kinda works:

SELECT 

att.item_id, 
att.store,
att.brand,

cost.item_id,
cost.store,
cost.price, 

main.item_id, 
main.store, 
main.name, 

FROM (item_name main LEFT OUTER JOIN item_attributes att) LEFT OUTER JOIN item_cost cost
ON main.item_id=att.item_id AND main.item_id=cost.item_id
WHERE main.store=cost.store AND main.store=att.store

But my question (in parts are):

Is there a simpler/clean query to perform the join without increasing the complexity (Big-O)?
Would it be computationally cheaper if I do nested join of main JOIN att then JOIN price? Or would it be the same as the chained LEFT OUTER JOIN as per the code above?
Would it be computationally cheaper if I main UNION att UNION price then group by item_id?

why you need is a FULL OUTER JOIN in all cases, as it seems that there are rows wiht missing data on all sides. even when you make it onoly for attributs and cost , because item_name must have at least one corresponding item_id — nbk, Feb 24 '23 at 22:27
You can use [Using EXPLAIN and EXPLAIN ANALYZE - Amazon Athena](https://docs.aws.amazon.com/athena/latest/ug/athena-explain-statement.html) to determine efficiency of the queries. How big are these tables to make you worry about query efficiency? Also, what is the storage format of the underlying data? It might be more 'efficient' to change the storage format to Snappy-compressed Parquet than to change the query. — John Rotenstein, Feb 24 '23 at 22:31
The number of rows are in 100 millions to billions for all 3 tables. They are parquet. — alvas, Feb 24 '23 at 22:34
LEFT JOIN returns INNER JOIN rows UNION ALL unmatched left table rows extended by NULLs. Always know what INNER JOIN you want as part of an OUTER JOIN. After a LEFT JOIN a WHERE, INNER JOIN or HAVING that requires a right [sic] table column to be not NULL removes any rows with introduced NULLs, ie leaves only INNER JOIN rows, ie "turns OUTER JOIN into INNER JOIN". You have that. Suggesting you might not understand left join and/or be getting what you want. — philipxy, Feb 25 '23 at 01:59
[How much research effort is expected of Stack Overflow users?](https://meta.stackoverflow.com/q/261592/3404097) [mre] [Tips for asking a good SQL question](https://meta.stackoverflow.com/q/271055/3404097) [Asking query performance questions](https://dba.meta.stackexchange.com/q/3034/43932) [How to Optimize Queries in a Database - The Basics](https://stackoverflow.com/q/3191623/3404097) [What are your most common sql optimizations?](https://stackoverflow.com/q/1332778/3404097) [When and why are database joins expensive?](https://stackoverflow.com/q/173726/3404097) PS Please ask 1 question. — philipxy, Feb 25 '23 at 02:03

score 1 · Accepted Answer · answered Feb 25 '23 at 00:51

1

A query joining multiple tables with millions of rows is going to be resource-intensive, but I assume you are doing this to create a 'master' table with joined data.

You can use Using EXPLAIN and EXPLAIN ANALYZE in Athena - Amazon Athena to test the 'efficiency' of the queries. I would also recommend using Snappy-compressed storage files, which are faster to retrieve from disk (and also cheaper, since Amazon Athena charges based on the amount of data read from disk).

Your joining query might be as simple as:

SELECT ...
FROM item_names
JOIN item_attributes USING (item_id, store)
JOIN item_cost USING (item_id, store)

However, if there are missing values you might need to use LEFT OUTER JOIN.

answered Feb 25 '23 at 00:51

John Rotenstein

241,921
22
380
470

I'm still working out how to effectively process by data using `EXPLAIN ANALYZE`, will get back if I find anything interesting given the data sizes. – alvas Feb 25 '23 at 12:35
The nice things about `USING` is that the 'joined' column only appears once in the output, so you don't have to dot-specify which table it comes from. Only works when both tables have the same named for the 'joined' column(s). – John Rotenstein Feb 26 '23 at 03:38

Efficiently joining selected columns from multiple tables in Athena based on a unique item_id key

1 Answers1