1

These are two tables below-

CREATE EXTERNAL TABLE IF NOT EXISTS Table1   (This is the MAIN table through which     comparisons need to be made) 
( 
ITEM_ID BIGINT, 
CREATED_TIME STRING, 
BUYER_ID BIGINT 
) 

CREATE EXTERNAL TABLE IF NOT EXISTS Table2 
( 
USER_ID BIGINT, 
PURCHASED_ITEM ARRAY<STRUCT<PRODUCT_ID: BIGINT,TIMESTAMPS:STRING>> 
) 

As BUYER_ID and USER_ID they both are same thing.

I need to find the total COUNT and all those BUYER_ID that are not there in Table2 by comparing from Table1. So I think it's a kind of Left Outer Join Query. I am new to HiveSql stuff so I am having problem to figure out what should be the actual syntax to do this in HiveQL. I wrote the below SQL Query. Can anyone tell me whether the SQL query below is fine or not to achieve my scenario?

SELECT COUNT(BUYER_ID), BUYER_ID 
FROM Table1 dw 
LEFT OUTER JOIN Table2 dps ON (dw.BUYER_ID = dps.USER_ID) 
GROUP BY BUYER_ID;
arsenal
  • 23,366
  • 85
  • 225
  • 331

1 Answers1

2

If I understand your requirements correctly, I think you are almost there. It seems you only need to add a condition checking if there's no match between the two tables:

SELECT COUNT(BUYER_ID), BUYER_ID 
FROM Table1 dw 
LEFT OUTER JOIN Table2 dps ON (dw.BUYER_ID = dps.USER_ID) 
WHERE dps.USER_ID IS NULL
GROUP BY BUYER_ID;

The above will filter out BUYER_IDs that do have matches in Table2, and will show the remaining BUYER_IDs and their corresponding count values. (Well, that's what I understand you want.)

arsenal
  • 23,366
  • 85
  • 225
  • 331
Andriy M
  • 76,112
  • 17
  • 94
  • 154
  • Thanks Andriy, But I was not able to understand what does this line mean? `WHERE pds.USER_ID IS NULL` – arsenal Jul 04 '12 at 16:09
  • 1
    The left join attempts to match rows of the two tables and returns *all rows* from the left side (which is `Table1`) and *only matching rows* from the right side of the join (`Table2`). If there's no match, the right-side columns are filled with NULLs. This is an intermediate result, before the `WHERE` condition is applied. Now, that condition looks at one of the right side's columns, the one that can't possibly contain NULLs *if there are matches*, and checks if it is NULL. If it *is* NULL (meaning there's no match), the corresponding non-matching row (and thus `BUYER_ID`) is returned. – Andriy M Jul 04 '12 at 17:03
  • Then GROUP BY is applied to the rows remained after the WHERE condition. As a result, you are getting BUYER_IDs that have no match in `Table2`, and the corresponding COUNT values. – Andriy M Jul 04 '12 at 17:04
  • Thanks Andriy for detailed description. Really Appreciated. I am stuck on one more question related to HiveQL. This is the question. [http://stackoverflow.com/questions/11336950/joining-two-tables-in-hive-using-hiveql](http://stackoverflow.com/questions/11336950/joining-two-tables-in-hive-using-hiveql). Any help will be appreciated. – arsenal Jul 05 '12 at 03:18
  • Sorry, that is definitely not my area of expertise. I only answered this question because it was tagged `sql` and your query looked SQL to me. But I'd known nothing about HiveQL until after I read about it some 15 minutes ago, so you can imagine that I've got absolutely no experience working with HiveQL's ARRAYs or STRUCTs. Sorry again, I would gladly help, but I just can't. – Andriy M Jul 05 '12 at 05:08
  • Really Appreciated all your help. – arsenal Jul 05 '12 at 07:12
  • @Nija, Sorry for bothering you like this, Can you please take a look into this problem?[link] (http://stackoverflow.com/questions/11336950/joining-two-tables-in-hive-using-hi‌​veql). It will be of great help to me. – arsenal Jul 05 '12 at 07:12