1

I need to get all the record from tableA greater than cdc_date which is stored in another tmp_table

tmp_table has only one column cdc_date and only one record.

tableA has more the 5 million records.

My Hive query

Select count(*) from tableA as a 
where unix_timestamp((concat_ws('-',a.year,a.month,a.day,a.hour)),"yyyy-MM-dd-HH") > 
(select b.cdc_date from tmp_table as b)

I am receiving below error

Unsupported SubQuery Expression 'cdc_date': Only SubQuery expressions that are top level conjuncts are allowed

Can anyone suggest how to active this.

Koushik Roy
  • 6,868
  • 2
  • 12
  • 33

2 Answers2

0

You need to rewrite the sql -

Select count(*) from tableA as a 
Left join tmp_table b
On unix_timestamp((concat_ws('-',a.year,a.month,a.day,a.hour)),"yyyy-MM-dd-HH") > 
b.cdc_date
Koushik Roy
  • 6,868
  • 2
  • 12
  • 33
  • In my environment hive.mapred.mode is set to strict, so cartesion products are disabled. I tried using left join, still I am receiving the same error "Cartesian products are disabled for safety reasons. – Kalpana Srinivasababu May 17 '21 at 11:38
  • its not really Cartesian join - its just left join. YOu can change it to normal join and see ? – Koushik Roy May 17 '21 at 12:12
  • I tried normal join as well. Still it says "Cartesian products are disabled for safety reasons". is it because we are trying to do non-equi join? – Kalpana Srinivasababu May 17 '21 at 12:24
  • strange. does any normal joins like `select a.id from a join b on a.id = b.id` works? – Koushik Roy May 17 '21 at 13:43
  • No, I want records greater than cdc_date. I am trying with clause now. With t1 as (select cdc_date from tmp_table) Select count(*) from tableA as a where unix_timestamp((concat_ws('-',a.year,a.month,a.day,a.hour)),"yyyy-MM-dd-HH") > t1; but this query throws "Invalid table alias or column reference 't1'" error – Kalpana Srinivasababu May 17 '21 at 15:02
  • i think your sql is wrong it should be ` With t1 as (select cdc_date from tmp_table) Select count(*) from tableA , t1 as a where unix_timestamp((concat_ws('-',a.year,a.month,a.day,a.hour)),"yyyy-MM-dd-HH") > t1.cde_date;` – Koushik Roy May 17 '21 at 15:09
  • 1
    @KoushikRoy Non-equi joins in Hive are implemented as CROSS join + filter, this is why your query also does not work. Because under the hood it is cross join. And query in your last comment:`... from tableA , t1 as a where ... ` - this is also the same CROSS join because join without ON is a cross – leftjoin May 17 '21 at 15:48
  • Some reading about joins https://stackoverflow.com/a/46843832/2700344 And this about theta joins https://stackoverflow.com/a/65788740/2700344 Also execute EXPLAIN for query with non-equi-join and you will see, it is CROSS-join in the plan. Cross-joins are inevitable, you can not restrict them and be happy. there are many cases when you really need cross join. Your query with single-row table is such example when you really need cross-join – leftjoin May 17 '21 at 20:44
  • I understand. The last SQL was a modification of OP's sql. I was correcting some syntax. – Koushik Roy May 18 '21 at 03:10
0

Cross join with single row table:

Select count(*) 
 from tableA as a 
      cross join tmp_table b
where unix_timestamp((concat_ws('-',a.year,a.month,a.day,a.hour)),"yyyy-MM-dd-HH") > b.cdc_date
leftjoin
  • 36,950
  • 8
  • 57
  • 116
  • In my environment hive.mapred.mode is set to strict, so cartesion products are disabled. – Kalpana Srinivasababu May 17 '21 at 11:41
  • @KalpanaSrinivasababu Without cross join enabled the only way is to use parameter with cdc_date fetched in another session, it creates unnecessary complications. Better set hive.mapred.mode=nonstrict; – leftjoin May 17 '21 at 11:51