1

How can I convert the following query to be compatible with Spark 1.6 which does not supported subqueries:

SELECT ne.device_id, sp.device_hostname
FROM `table1` ne INNER JOIN `table2` sp 
ON sp.device_hostname = 
                      (SELECT device_hostname FROM `table2` 
                      WHERE device_hostname LIKE 
                      CONCAT(ne.device_id,'%') ORDER BY device_hostname DESC LIMIT 1)

I have read that it supports subqueries specified in the FROM but not WHERE but the following also doesn't work:

SELECT * FROM (SELECT ne.device_id, sp.device_hostname
FROM `table1` ne INNER JOIN `table2` sp 
ON sp.device_hostname = 
                      (SELECT device_hostname FROM `table2` 
                      WHERE device_hostname LIKE 
                      CONCAT(ne.device_id,'%') ORDER BY device_hostname DESC LIMIT 1)) AS TA

My overall goal is to join two tables though only get the last record from table2. The SQL statements are valid though when I use them within a HiveContext.sql within Spark I get an Analysis Exception.

user6666914
  • 31
  • 1
  • 6

1 Answers1

0

You can use HiveContext and window functions (reference How to select the first row of each group?)

scala> Seq((1L, "foo")).toDF("id", "device_id").registerTempTable("table1")

scala> Seq((1L, "foobar"), (2L, "foobaz")).toDF("id", "device_hostname").registerTempTable("table2")

scala> sqlContext.sql("""
     |   WITH tmp AS (
     |     SELECT ne.device_id, sp.device_hostname, row_number() OVER (PARTITION BY device_id ORDER BY device_hostname) AS rn
     |     FROM table1 ne INNER JOIN table2 sp 
     |     ON sp.device_hostname LIKE CONCAT(ne.device_id, '%'))
     |   SELECT device_id, device_hostname FROM tmp WHERE rn = 1
     | """).show
+---------+---------------+                                                     
|device_id|device_hostname|
+---------+---------------+
|      foo|         foobar|
+---------+---------------+

but with only two columns you can aggregate:

scala> sqlContext.sql("""
     |  WITH tmp AS (
     |    SELECT ne.device_id, sp.device_hostname
     |    FROM table1 ne INNER JOIN table2 sp 
     |    ON sp.device_hostname LIKE CONCAT(ne.device_id, '%'))
     |  SELECT device_id, min(device_hostname) AS device_hostname
     |  FROM tmp GROUP BY device_id 
     |""").show
+---------+---------------+                                                     
|device_id|device_hostname|
+---------+---------------+
|      foo|         foobar|
+---------+---------------+

To improve performance you should try to replace LIKE with equality condition How can we JOIN two Spark SQL dataframes using a SQL-esque "LIKE" criterion?

Alper t. Turker
  • 34,230
  • 9
  • 83
  • 115
  • Both methods above lead to excessive hang of over an hour with no conclusion so I suspect they don't work. The following command is near instant: `sql_context.sql("Select * from table2,table1 where device_id=device_hostname)`. I expect there to be some latency but over an hour suggests that it doesn't work. – user6666914 Jan 23 '18 at 18:13
  • Do you suggest any other way to tackle the problem? – user6666914 Jan 23 '18 at 18:17
  • There are some hints here https://stackoverflow.com/q/33168970/8371915 - to summarize - try to do things without like. – Alper t. Turker Jan 23 '18 at 19:42