2

I have written a query to find 10 most busy airports in the USA from March to April. It produces the desired output however I want to try to further optimize it.

Are there any HiveQL specific optimizations that can be applied to the query? Is GROUPING SETS applicable here? I'm new to Hive and for now this is the shortest query that I've come up with.

SELECT airports.airport, COUNT(Flights.FlightsNum) AS Total_Flights
FROM (
SELECT Origin AS Airport, FlightsNum 
  FROM flights_stats
  WHERE (Cancelled = 0 AND Month IN (3,4))
UNION ALL
SELECT Dest AS Airport, FlightsNum 
  FROM flights_stats
  WHERE (Cancelled = 0 AND Month IN (3,4))
) Flights
INNER JOIN airports ON (Flights.Airport = airports.iata AND airports.country = 'USA')
GROUP BY airports.airport
ORDER BY Total_Flights DESC
LIMIT 10;

The table columns are as following:

Airports

|iata|airport|city|state|country|

Flights_stats

|originAirport|destAirport|FlightsNum|Cancelled|Month|
leftjoin
  • 36,950
  • 8
  • 57
  • 116
samba
  • 2,821
  • 6
  • 30
  • 85

4 Answers4

3

It might help if you do the aggregation before the union all:

SELECT a.airport, SUM(cnt) AS Total_Flights
FROM ((SELECT Origin AS Airport, COUNT(*) as cnt 
       FROM flights_stats
       WHERE (Cancelled = 0 AND Month IN (3,4))
       GROUP BY Origin
      ) UNION ALL
      (SELECT Dest AS Airport, COUNT(*) as cnt
       FROM flights_stats
       WHERE Cancelled = 0 AND Month IN (3,4)
       GROUP BY Dest
      )
     ) f INNER JOIN
     airports a
     ON f.Airport = a.iata AND a.country = 'USA'
GROUP BY a.airport
ORDER BY Total_Flights DESC
LIMIT 10;
Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
3

Filter by airport(inner join) and do aggregation before UNION ALL to reduce dataset passed to the final aggregation reducer. UNION ALL subqueries with joins should run in parallel and faster than join with bigger dataset after UNION ALL.

SELECT f.airport, SUM(cnt) AS Total_Flights
FROM (
      SELECT a.airport, COUNT(*) as cnt 
       FROM flights_stats f
            INNER JOIN airports a ON f.Origin=a.iata AND a.country='USA'
       WHERE Cancelled = 0 AND Month IN (3,4)
       GROUP BY a.airport
       UNION ALL
      SELECT a.airport, COUNT(*) as cnt
       FROM flights_stats f
            INNER JOIN airports a ON f.Dest=a.iata AND a.country='USA'
       WHERE Cancelled = 0 AND Month IN (3,4)
       GROUP BY a.airport
     ) f 
GROUP BY f.airport
ORDER BY Total_Flights DESC
LIMIT 10
;

Tune mapjoins and enable parallel execution:

set hive.exec.parallel=true;
set hive.auto.convert.join=true; --this enables map-join
set hive.mapjoin.smalltable.filesize=25000000; --size of table to fit in memory

Use Tez and vectorizing, tune mappers and reducers parallelism: https://stackoverflow.com/a/48487306/2700344

leftjoin
  • 36,950
  • 8
  • 57
  • 116
  • I like it. Do all the work in the inline view. Then your outer query just presents the results in the desired way. – hrobertv Mar 07 '18 at 15:08
  • I would probably move the a.country='USA' down to the WHERE predicate instead of a join condition. But I don't think it will matter. Just a preference maybe. – hrobertv Mar 07 '18 at 15:09
  • I'd make the alias of flight stats fs instead of just f, especially since your inline view alias is f. – hrobertv Mar 07 '18 at 15:24
  • @hrobertv could you also share your opinion on the aspect that I've mentioned in my question - if `GROUPING SETS` is applicable here? I've read Hive documentation about it but still can't figure out how to apply it in my query. – samba Mar 07 '18 at 15:38
2

I don't think GROUPING SETS are applicable here because you are only grouping by one field.

From Apache Wiki: "The GROUPING SETS clause in GROUP BY allows us to specify more than one GROUP BY option in the same record set."

hrobertv
  • 158
  • 1
  • 8
0

You can test this but you are in the case where an Union maybe better, so You really need to test it and come back :

SELECT airports.airport,
SUM(
  CASE 
     WHEN T1.FlightsNum IS NOT NULL THEN 1
     WHEN T2.FlightsNum IS NOT NULL THEN 1
     ELSE 0
  END 
  ) AS Total_Flights
FROM airports
LEFT JOIN (SELECT  Origin AS Airport, FlightsNum 
    FROM flights_stats
   WHERE (Cancelled = 0 AND Month IN (3,4))) t1 
 on t1.Airport = airports.iata
LEFT JOIN (SELECT Dest AS Airport, FlightsNum 
   FROM flights_stats
   WHERE (Cancelled = 0 AND Month IN (3,4))) t2
 on t1.Airport = airports.iata
GROUP BY airports.airport
ORDER BY Total_Flights DESC
Daniel E.
  • 2,440
  • 1
  • 14
  • 24
  • 1
    Your query may produce wrong results because LEFT JOIN can multiply records. For example 10 records with Origin='NY' and 10 records with Dest='NY' will give you 100 records in total after 2 joins, right? Also need to add `limit 10` and filter airport by `a.country = 'USA'` – leftjoin Mar 07 '18 at 14:53