23

I am running a query like:

SELECT f.*, p.countryName, p.airportName, a.name AS agentName
FROM (
    SELECT 
        f.outboundlegid, 
        f.inboundlegid,
        f.querydatetime,
        cast(f.agent as bigint) as agent,
        cast(f.querydestinationplace as bigint) as querydestinationplace,
        f.queryoutbounddate,
        f.queryinbounddate,
        f.quoteageinminutes,
        f.price
    FROM flights f
    WHERE querydatetime >= '2018-01-02'
    AND querydatetime <= '2019-01-10'
) f
INNER JOIN (
  SELECT airportId, airportName, countryName
  FROM airports
  WHERE countryName IN ('Philippines', 'Indonesia', 'Malaysia', 'Hong Kong', 'Thailand', 'Vietnam')
) p
ON f.querydestinationplace = p.airportId
INNER JOIN agents a
ON f.agent = a.id
ORDER BY f.outboundlegid, f.inboundlegid, f.agent, querydatetime DESC

What's wrong with it? Or how can I optimize it? It gives me

Query exhausted resources at this scale factor

I have a flights table and I want to query for flights inside a specific country

Piotr Findeisen
  • 19,480
  • 2
  • 52
  • 82
Jiew Meng
  • 84,767
  • 185
  • 495
  • 805
  • How many rows are in each table? What is the format of the source files? Does it work if you remove the `ORDER BY`? I wonder if it would help to use normal JOINS without subqueries -- that is, `SELECT xx FROM flights JOIN airports ON xx JOIN agents on xx WHERE xx`. See also: [Top 10 Performance Tuning Tips for Amazon Athena | AWS Big Data Blog](https://aws.amazon.com/blogs/big-data/top-10-performance-tuning-tips-for-amazon-athena/) – John Rotenstein Jan 26 '19 at 07:00
  • 1
    Rows ... hmm flights 2,462,275,635. agents 99, places 434. Everything is parquet. Hmm trying without order by ... I tried joins b4, now you mention it, it works. Just that I thought maybe using a subquery will limit the table size faster leading to a hopefully faster join. – Jiew Meng Jan 27 '19 at 08:26
  • @JohnRotenstein, it works without ORDER BY ... `15 minutes 53 seconds, Data scanned: 2.71 GB` surprisingly, it does not look like too huge a data set to be processed? – Jiew Meng Jan 27 '19 at 08:40
  • If you want faster (or sorting), you could load the data into Redshift. – John Rotenstein Jan 27 '19 at 08:43
  • I am asked to use Athena because RedShift was too expensive haha! – Jiew Meng Jan 27 '19 at 08:46
  • Athena charges by volume scanned. If repeated queries are run over the same data, it could rival Redshift in terms of cost. Redshift would give better performance, but Athena gives serverless flexibility. It's for you to choose. – John Rotenstein Jan 27 '19 at 10:03
  • This might not be a problem if you use the Athena Preview - https://aws.amazon.com/athena/faqs follow instructions in "How do I test the preview features?". Not sure when the preview will replace the current Athena, but you can try it right now. – Steve Potter Feb 19 '20 at 17:40

1 Answers1

43

I have been facing this problem since the begining of Athena, the problem is the ORDER BY clause. Athena is just an EMR cluster with hive and prestodb installed. The problem you are facing is: Even if your query is distributed across X numbers of nodes, the ordering phase must be done by just a single node, the master node in this case. So at the end, you can order as much data as memory have the master node.

You can test it by reducing the amount of data the query returns maybe reducing the time range.

Asclepius
  • 57,944
  • 17
  • 167
  • 143
Roberto
  • 757
  • 1
  • 12
  • 21
  • 3
    . . The ordering does not *have to be done by a single node*. That is just -- alas -- how lots of parallel databases implement the ordering. Parallel sorts in SQL have been around for decades. It frustrates me that more modern systems don't use them. – Gordon Linoff Jan 26 '19 at 13:32
  • 5
    Presto has distributed sort for some time already. Athena is based on a quite old Presto version though. – Piotr Findeisen Jan 26 '19 at 13:44
  • 3
    Well, I'm not saying this is nice, this is amazing or this is how it works in the rest of the world. I'm saying how Athena works. And I know it because due my experience with Athena, I have been told this is the problem with ordering. Up to you guys to believe or not.... – Roberto Jan 26 '19 at 23:00
  • 2
    It does indeed seem like order by is causing the issue ... when I remove order by, it runs: 15 minutes 53 seconds, Data scanned: 2.71 GB. I notice its only 2.71GB of data, why does athena fail to handle this relatively small amount of data? – Jiew Meng Jan 27 '19 at 08:44
  • It's more important the output size. How big is the output file name without ordering? – Roberto Jan 27 '19 at 09:15
  • @Roberto its 30GB – Jiew Meng Jan 27 '19 at 10:57
  • In my tests the maximum output file size ordered was 10GB. That might be different in your case. In order to solve this problem I followed two ways: split the query or order the file outside Athena. You can choose whatever is the best for you. Athena is an amazing service but not perfect. :). I hope this helps – Roberto Jan 27 '19 at 16:09
  • 2
    In my experience the problem usually gets worse with wide tables. The dataset can be pretty small, but if it has 30 columns or so sorting is very often not possible. – Theo Feb 03 '19 at 20:35