2

Below is the data in TestingTable1 sorted by date in descending order always

BUYER_ID  |   ITEM_ID       |  CREATED_TIME
----------+-----------------+----------------------
1345653      110909316904     2012-07-09 21:29:06
1345653      151851771618     2012-07-09 19:57:33
1345653      221065796761     2012-07-09 19:31:48
1345653      400307563710     2012-07-09 18:57:33

And if this is the below data in TestingTable2 sorted by date in descending order always

USER_ID  |   PRODUCT_ID    |    LAST_TIME
---------+----------------+-----------------------
1345653     110909316904      2012-07-09 22:29:06
1345653     150851771618      2012-07-09 19:57:33    

Each row in TestingTable1 should match with TestingTable2, if doesn't match or data is not there in TestingTable2, then I need to show them in the output as in TestingTable1 I have this data but corresponding to TestingTable2 I have this data(which will be wrong data), so that I can see what is mismatch and what data is missing.

I need to compare TestingTable2 with TestingTable1 on BUYER_ID and USER_ID. I need to see, if BUYER_ID and USER_ID gets matched then I need to compare ITEM_ID with PRODUCT_ID and CREATED_TIME with LAST_TIME and if there is a mismatch in TestingTable2 after comparing with TestingTable1 in either one of them or both of them, then I need to show the result.

So if you look at the above example- I have three scenarios basically

  1. Firstly- In TestingTable1, in the First row ITEM_ID is matching with PRODUCT_ID in the First row of TestingTable2 but CREATED_TIME is not matching with LAST_TIME for the first row in both the tables
  2. Secondly- In TestingTable1, in the Second row CREATED_TIME is matching with LAST_TIME in the second row of TestingTable2 but ITEM_ID is not matching with PRODUCT_ID for the second row in both the tables
  3. Thirdly- In TestingTable1, last two lines(rows) are not there at all in TestingTable2. This scenario is not covered in my query that I wrote. I want this scenario also in my query.

So these are the three cases that I need to cover while comparing TestingTable2 with TestingTable1 always. And TestingTable1 is the MAIN table through which comparisons need to be made always, so it means data in TestingTable1 is always accurate.

So I need to show the result like below considering the above example if not matching either one of them or data is not there in TestingTable2 at all- TestingTable1 data then next to it same TestingTable2 data, so that I can see what value was there in TestingTable1 as compared to TestingTable2

BUYER_ID   |   ITEM_ID       |    CREATED_TIME           |      USER_ID   |     PRODUCT_ID     |     LAST_TIME   
-----------+-----------------+---------------------------+----------------+--------------------+-----------------------
1345653      110909316904       2012-07-09 21:29:06            1345653          110909316904      2012-07-09 22:29:06
1345653      151851771618       2012-07-09 19:57:33            1345653          150851771618      2012-07-09 19:57:33
1345653      221065796761       2012-07-09 19:31:48            NULL             NULL              NULL
1345653      400307563710       2012-07-09 18:57:33            NULL             NULL              NULL

Below is the query I wrote that covers only my two scenarios that I mentioned above and it works fine and I will get the output like above leaving last two rows from the Output. But I need to add third scenario also in this(below) query so that it gives output like above.

SELECT *
FROM   (SELECT T2.buyer_id,
               T2.item_id,
               T2.created_time AS created_time,
               subq.user_id,
               subq.product_id,
               subq.LAST_TIME
        FROM   TestingTable2 subq
               JOIN TestingTable1 T2
                 ON T2.buyer_id = subq.user_id
                    AND subq.LAST_TIME = ( T2.created_time )
        WHERE  ( subq.product_id <> T2.item_id )
        UNION ALL
        SELECT T2.buyer_id,
               T2.item_id AS item_id,
               T2.created_time,
               subq.user_id,
               subq.product_id AS product_id,
               subq.LAST_TIME
        FROM   TestingTable2 subq
               JOIN TestingTable1 T2
                 ON T2.buyer_id = subq.user_id
                    AND subq.product_id = T2.item_id
        WHERE  ( subq.LAST_TIME <> ( T2.created_time ) )) finalResult
ORDER  BY finalResult.BUYER_ID;

Any suggestion will be appreciated.

P.S. I have asked few questions related to JOIN in past few days, but that covers only my two scenarios, not the third scenario that I need in this query.

Update:- I cannot use NOT IN or NOT EXISTS syntax for SQL, as I am working with Hive and Hive doesn't support NOT IN or NOT EXISTS that so I need some other way to deal with this.

I need to use my query only to modify it to work for Third Scenario, as that way Hive will support SQL syntax.

Below is my SQL Fiddle, that fulfills my two scenario above but not the third scenario. Can anyone help me to modify my query to work for third scenario as well?

http://sqlfiddle.com/#!3/102dd/1/0.

Data in Table1 should be there in Table2, if it is not there, then I need to show the mismatch between data after comparing from Table1 and also it is possible, data from Table1 will not be there in Table2, and I want to show that too.

Updated Output for the fiddle http://sqlfiddle.com/#!3/102dd/3/0

BUYER_ID   |   ITEM_ID       |    CREATED_TIME           |      USER_ID   |     PRODUCT_ID     |     LAST_TIME   
-----------+-----------------+---------------------------+----------------+--------------------+-----------------------
1345653       151851771618      July, 09 2012 19:57:33      1345653            150851771618         July, 09 2012 19:57:33
1345653       221065796761      July, 09 2012 19:31:48      1345653            221165796761         July, 09 2012 19:31:48
1345653       110909316904      July, 09 2012 21:29:06      1345653            110909316904         July, 09 2012 22:29:06
1345653       400307563710      July, 09 2012 18:57:33      NULL               NULL                 NULL
1345653       310411560125      July, 09 2012 16:09:49      NULL               NULL                 NULL

UPDATED SQL QUERY THAT IS GIVING ME ERROR

I replaced TestingTable1 with this query-

(SELECT BUYER_ID, ITEM_ID, rank(BUYER_ID), CREATED_TIME
FROM (
    SELECT BUYER_ID, ITEM_ID, CREATED_TIME
    FROM testingtable1
    where to_date(from_unixtime(cast(UNIX_TIMESTAMP(CREATED_TIME) as int))) = '2012-07-09'
    DISTRIBUTE BY BUYER_ID
    SORT BY BUYER_ID, CREATED_TIME desc
) T1
WHERE rank(BUYER_ID) < 5)

And TestingTable2 with this query-

(SELECT USER_ID, PROD_AND_TS.PRODUCT_ID as PRODUCT_ID, PROD_AND_TS.TIMESTAMPS as TIMESTAMPS FROM testingtable2 lateral view explode(PURCHASED_ITEM) exploded_table as PROD_AND_TS where to_date(from_unixtime(cast(PROD_AND_TS.TIMESTAMPS as BIGINT))) = '2012-07-09')
user1166147
  • 1,570
  • 2
  • 15
  • 17
arsenal
  • 23,366
  • 85
  • 225
  • 331
  • Might the same `(user,item)` pairs appear more than once in either table? – eggyal Jul 13 '12 at 06:57
  • Both the tables will always be sorted in descending order of the created_time. And both the tables will contain data for same date only. (Just now the table is, it will be like this only in real time). Each row in Table1 should be there in Table2, so the possibility are all the three cases that I mentioned above. No they will not appear more than once. – arsenal Jul 13 '12 at 06:59
  • 1
    See your fiddle, I modified the data, and the output is wrong, [http://sqlfiddle.com/#!3/102dd/3/0](http://sqlfiddle.com/#!3/102dd/3/0), You haven't understood my question, data in Table1 should be there in Table2, if it is not there, then I need to show the mismatch between data after comparing from Table1 and also it is possible, data from Table1 will not be there in Table2, and I want to show that too. If you look at my output. – arsenal Jul 13 '12 at 07:10
  • Basically each row in Table1 should match with Table2, like first row in Table1 should match with first row in Table2, it might be possible that ITEM_ID and PRODUCT_ID will not match and TIME will match and also it is possible ITEM_ID and PRODUCT_ID will match but TIME will not match. So I need to show the output that Corresponding to Table1 I have this data, but the same data in Table2 is this(which is wrong data). try running my query from the question on the SQL fiddle that I gave it to you. You will get better picture what I am looking for. And my SQL query works for two cases in my quest. – arsenal Jul 13 '12 at 07:25
  • I know about that stuff, but for explaining I said like that, If you run my query, you will get the output for my two scenarios. So in the output, you can see, that Table1 has this data but same data in Table2 is this, (which is wrong data) and that way I am doing comparison. – arsenal Jul 13 '12 at 07:34
  • See my updated question for the output for the above fiddle. – arsenal Jul 13 '12 at 07:46
  • 1
    How about [this](http://sqlfiddle.com/#!3/102dd/9)? – eggyal Jul 13 '12 at 07:55
  • @eggyal - You should post this as an answer. – Lieven Keersmaekers Jul 13 '12 at 08:07
  • @eggyal, Thanks it works. But only in SQL, as I am working with Hive and HiveQL doesn't supports `OR` with JOIN as I always get error as `FAILED: Error in semantic analysis: line 12:7 OR not supported in Join currently`. So is there any other way to do this SQL and get the same output? As I am afraid this will not work in HiveQL. :( But really appreciated all your help. Any suggestion if there is any other way to do this SQL will be of great help to me. – arsenal Jul 13 '12 at 08:08
  • @rjchar - I have posted another way to do this but I have no idea if your RDBMS allows it. – Lieven Keersmaekers Jul 13 '12 at 08:43
  • In SQL it is working fine. I am currently trying in HiveQL. The only thing I need to replace in your query is, I need to replace `TestingTable1` with some query and `TestingTable2` with some query. So I think I can replace any Table with the query. So I did that, but the error that I am getting is- `FAILED: Error in semantic analysis: Ambiguous Column Reference _C2` after replacing the Table Name with the SQL query. I am going to edit my question with the full SQL after my changes that I am going to use in HiveQL. – arsenal Jul 13 '12 at 08:50

4 Answers4

2

If you can't use NOT IN or the clean solution posted by @eggyal because of DBMS restrictions, another option might be to completely duplicate your original union and use those results in a LEFT JOIN.

Applied to your sqlFiddle, following statement returns the required results.

SQL Statement

SELECT * 
FROM(
    SELECT *
    FROM TestingTable1 A
    JOIN TestingTable2 B ON A.BUYER_ID = B.USER_ID AND B.LAST_TIME = A.Created_TIME 
    WHERE B.PRODUCT_ID <> A.ITEM_ID
    UNION ALL
    SELECT * 
    FROM TestingTable1 A
    INNER JOIN TestingTable2 B ON A.BUYER_ID = B.USER_ID AND B.PRODUCT_ID = A.ITEM_ID  
    WHERE B.LAST_TIME <> A.Created_TIME      
 ) X
UNION ALL
SELECT A.*, NULL, NULL, NULL
FROM   TestingTable1 A
       LEFT OUTER JOIN (
            SELECT *
            FROM TestingTable1 A
            JOIN TestingTable2 B ON A.BUYER_ID = B.USER_ID AND B.LAST_TIME = A.Created_TIME 
            WHERE B.PRODUCT_ID <> A.ITEM_ID
            UNION ALL
            SELECT * 
            FROM TestingTable1 A
            INNER JOIN TestingTable2 B ON A.BUYER_ID = B.USER_ID AND B.PRODUCT_ID = A.ITEM_ID  
            WHERE B.LAST_TIME <> A.Created_TIME      
       ) X ON A.BUYER_ID = X.BUYER_ID AND A.ITEM_ID = X.ITEM_ID
WHERE  X.BUYER_ID IS NULL
Lieven Keersmaekers
  • 57,207
  • 13
  • 112
  • 146
  • Thanks Lieven for great answer, Your query is working fine in SQL, for working in HiveQL, I need to replace TestingTable1 and TestingTable2 with some SQL query, So I replace TestingTable1 with some SQL query and TestingTable2 with some SQL query, and then afterward when I tried running in Hive, `I got some ambiguous columns name error`. Can you take a look what wrong I am doing in that SQL query? It is happening bcoz of some column names – arsenal Jul 13 '12 at 08:58
  • I can't execute your actual statement *(and I doubt you are going to create a fiddle for it :)* but for errors like this, I would use a divide and conquer technique. I would start by removing the first half of the union and see if the statement runs. If it doesn't, remove another union and see if it runs. Repeat that until you get a runnable statement. The error would then come from the last removed section of the statement. – Lieven Keersmaekers Jul 13 '12 at 09:09
  • `X UNION ALL` above this query works and after this query works, but when I run this together, then it doesn't works. I have edited my question by putting, what I am replacing with both of the tables. May be from that you can help me out I guess. – arsenal Jul 13 '12 at 09:30
  • @rjchar - what if you replace `SELECT A.*, NULL, NULL, NULL` with `SELECT A.*, NULL AS User_ID, NULL AS Product_ID, NULL AS Laste_Time` – Lieven Keersmaekers Jul 13 '12 at 09:33
  • No, It complains some different error know..`FAILED: Error in semantic analysis: Schema of both sides of union should match: Column user_id is of type bigint on first table and type void on second table` – arsenal Jul 13 '12 at 09:37
  • @rjchar - I have no knowledge whatsoever of HiveQL and how idiot it may sound, I would try to tackle that error by casting the `NULL` to a `bigint`. – Lieven Keersmaekers Jul 13 '12 at 09:41
  • let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/13837/discussion-between-rjchar-and-lieven) – arsenal Jul 13 '12 at 09:42
1

Try this piece of code, written in SQL. I tested this on SQL Fiddle already.

 SELECT 
 tt1.buyer_id,tt1.item_id,tt1.created_time,
 tt2.user_id,tt2.product_id,tt2.last_time
 FROM 
 testingtable1 tt1 LEFT OUTER JOIN
 testingtable2 tt2 ON
 tt1.buyer_id = tt2.user_id
 AND 
 tt1.item_id = tt2.product_id
 AND
 tt1.created_time = tt2.last_time
bryan.blackbee
  • 1,934
  • 4
  • 32
  • 46
  • This is also not working, this gives back the output that I don't need at all. :(. This is the SQL fiddle in which data is there, we need some sql query that gives the OUTPUT like I showed in my question. [http://sqlfiddle.com/#!3/fe795/1/0](http://sqlfiddle.com/#!3/fe795/1/0) – arsenal Jul 13 '12 at 06:39
  • This is my upto date SQL fiddle and my query is also there which fulfills only my two scenarions but not the third one. [http://sqlfiddle.com/#!3/102dd/1/0](http://sqlfiddle.com/#!3/102dd/1/0). – arsenal Jul 13 '12 at 06:56
  • See what output I am getting after putting your query [http://sqlfiddle.com/#!3/102dd/2/0](http://sqlfiddle.com/#!3/102dd/2/0). This is wrong, If you compare my output in the question with your output, you will see the difference. In this I fiddle I have added few more data. – arsenal Jul 13 '12 at 07:05
1

This answer is in response to your comment request to do so here: https://stackoverflow.com/a/11440651/1166147.
I have actually given you the answer to this in 2 of your other duplicates that creates the exact output you are display. It is the first query I wrote here: https://stackoverflow.com/a/11440539/1166147 and it is mentioned and explained twice here: https://stackoverflow.com/a/11398990/1166147

I modified your query back to mine and ran it in your SQLFiddle:

enter image description here

Your requirements have evolved though, and it looks like you could now just do a LEFT JOIN if TestingTable1 really is always accurate data.

SELECT *
FROM TestingTable1 A
LEFT JOIN TestingTable2 B ON A.BUYER_ID = B.USER_ID AND (B.LAST_TIME = A.Created_TIME  OR B.PRODUCT_ID = A.ITEM_ID)

EDIT FOR COMMENT

MOCK SCENARIOS

SCENARIO||       TABLE1           ||           TABLE2
----------------------------------------------------------
SCENARIO||  BUYER  ITEM      TIME ||  USER   PRODUCT  TIME
1       ||    1     A        09   ||   1     Z        09
2       ||    1     B        10   ||   NO RECORD IN TABLE 2
3       ||    1     C        10   ||   1     C        02
4       ||    1     D        12   ||   NO RECORD IN TABLE 2
5       ||    1     E        01   ||   1     E        01
6       ||   NO RECORD IN TABLE 1 ||   1     Y        05      

You are asking for a SQL solution - but it isn't really a SQL question. You rule out the valid SQL answers given to you because you can't use them in your environment, then reopen the same question again. This is a HQL/HIVE question. Yes, there are some similarities, but there are some differences.

It looks like HQL would support something like this, though I have no way to test and SQLFiddle isn't valid for testing this either. You will have to work out how to do ISNULL or COALESCE in HQL, as shown on the comments beside the select *. This would 'merge' the results and return the value from whichever wasn't null if there was a non null value there. I do believe HQL supports ISNULL:

select * --BUYER_ID, isNull(B.USER_ID,C.USER_ID)
from
(select BUYER_ID,ITEM_ID ,Created_TIME  from TestingTable1) a
left join
(SELECT USER_ID,PRODUCT_ID, last_time
  FROM TestingTable2 ) b on(a.BUYER_ID = b.user_id and B.last_time =   A.Created_TIME)
left join
(SELECT USER_ID,PRODUCT_ID, last_time
  FROM TestingTable2 ) c on(a.BUYER_ID = c.user_id and c.PRODUCT_ID = A.ITEM_ID)

Here is yet another TSQL answer that may give you some ideas. I'd never use this in TSQL but it would actually work.

SELECT A.BUYER_ID,A.ITEM_ID,CREATED_TIME,COALESCE(B.USER_ID,X.USER_ID,Y.USER_ID),
COALESCE(B.PRODUCT_ID,X.PRODUCT_ID,Y.PRODUCT_ID)
,COALESCE(B.last_time,X.last_time,Y.last_time)
FROM TestingTable1 A
LEFT JOIN TestingTable2 B ON A.BUYER_ID = B.USER_ID AND B.PRODUCT_ID = A.ITEM_ID
AND B.last_time = A.Created_TIME
LEFT JOIN( SELECT USER_ID,PRODUCT_ID, last_time
  FROM TestingTable1 A
  LEFT JOIN TestingTable2 B ON A.BUYER_ID = B.USER_ID AND B.last_time = A.Created_TIME 
  WHERE  ISNULL(B.PRODUCT_ID,0) <> A.ITEM_ID AND B.USER_ID IS NOT NULL) X ON
  X.USER_ID = A.BUYER_ID AND A.Created_TIME = X.last_time
LEFT JOIN( SELECT USER_ID,PRODUCT_ID, last_time
FROM TestingTable1 A
LEFT JOIN TestingTable2 B ON A.BUYER_ID = B.USER_ID AND ISNULL( B.PRODUCT_ID,0) =   A.ITEM_ID  
WHERE  B.last_time <> A.Created_TIME AND B.USER_ID IS NOT NULL) Y ON
A.BUYER_ID = Y.USER_ID AND A.ITEM_ID = Y.PRODUCT_ID
Community
  • 1
  • 1
user1166147
  • 1,570
  • 2
  • 15
  • 17
  • This works in SQL though.But it will not work in `HiveQL` as Hive doesn't supports OR in JOIN, so that is the reason I need some other way to tackle this. Any suggestions. – arsenal Jul 13 '12 at 23:00
  • Edited - but you need to stop reopening this - especially as a SQL question. It is HQL and while there are similarities, there are differences, and you are having people spend time on it and then saying their valid SQL answers don't work - when they do - for SQL. – user1166147 Jul 14 '12 at 04:40
0

It sounds like what you are looking for is a Full outer join. I didn't see in your post what type of database you are using so I can't post the exact syntax, but this link may point you in the right direction:

http://www.w3resource.com/sql/joins/perform-a-full-outer-join.php

Specifically look at the diagram at the bottom.

Hermit
  • 377
  • 1
  • 3
  • I am working with Hive and Hive supports SQL like syntax, so any sql syntax will work I guess, but I need to add that feature in my above query, as my above query is working fine for my above two case. – arsenal Jul 13 '12 at 05:12
  • If that's the case, I would try something like this: SELECT T2.buyer_id, T2.item_id, T2.created_time as created_time, subq.user_id, subq.product_id, subq.LAST_TIME FROM TestingTable2 subq FULL OUTER JOIN TestingTable1 T2 ON T2.buyer_id = subq.user_id AND T2.item_id = subq.product_id AND T2.CREATED_TIME = subq.LAST_TIME sorry I forgot to add the Full Outer Join – Hermit Jul 13 '12 at 05:22
  • I tried the above query, and the output that I am getting by using this query, I don't need that output. I am getting NULL in front of every row. I need the OUPTUT in my above format. – arsenal Jul 13 '12 at 05:43
  • Oh, I believe I misunderstood your question, I thought you wanted to see records in table2 if table1 had no matches as well as table 1 if table 2 had no matches and both if they do match. If you just want to see table1 at all times, and only show table 2 when there are matches (otherwise show null) like your table in the middle of the page, that's a left join, or right outer join (depending how you write it). – Hermit Jul 13 '12 at 05:51
  • Can you refactor my query by changing the JOIN as per your suggestion, then I will be able to understand more. – arsenal Jul 13 '12 at 05:52
  • Sure, give this a shot: SELECT T1.buyer_id, T1.item_id, T1.created_time as created_time, T2.user_id, T2.product_id, T2.LAST_TIME FROM TestingTable1 T1 LEFT JOIN TestingTable2 T2 ON T1.buyer_id = T2.user_id AND T1.item_id = T2.product_id AND T1.CREATED_TIME = T2.LAST_TIME Notice, I changed the table aliases to make more sense (to me). – Hermit Jul 13 '12 at 05:55
  • I am not sure whether this will work or not. As I want to show Table1 always and next to Table2 data which is there in corresponding Table1, If you see my Output in the question that specifically mentions, Table1 data and next to Table2 data(which is wrong data), but in your case I am getting only one column. – arsenal Jul 13 '12 at 06:02
  • Only one column? Which column are you getting? I suspect a copy/past issue via the comments. – Hermit Jul 13 '12 at 06:10
  • 1
    If you see this sql fiddle, [http://sqlfiddle.com/#!3/d8a67/3/0](http://sqlfiddle.com/#!3/d8a67/3/0) that I just added. I need same like this, but I need to add this thing in my SQL query as I need to show this using only single query. – arsenal Jul 13 '12 at 06:12
  • I just ran my query in this page and got the output you were looking for, 4 rows, first two had all 6 columns populated, last two had data in column 1,2,3, and nulls in 4,5,6 – Hermit Jul 13 '12 at 06:17
  • In your table there was some problem, I have modifed your table data, and then after running I am always getting null. [http://sqlfiddle.com/#!3/fe795/1/0](http://sqlfiddle.com/#!3/fe795/1/0) This is the fiddle and your query is there, but I need output as mentioned in my question. – arsenal Jul 13 '12 at 06:25
  • So, I think you are beginning to get tired. The data in the last example you linked to me is different. It is completely null in the last 3 columns because the data truly doesn't match. You changed the table2 product id from 151851771618 to 150851771618....you also appear to have changed the hour on the first row in the Table2 – Hermit Jul 13 '12 at 06:34
  • Yup, that was by mistake. I apologize for that, I noticed after you posted your fiddle. If you see my question, currently the data in the question is same as in the fiddle I gave it to you. Sorry about that. – arsenal Jul 13 '12 at 06:36
  • How what can be done? I thought we already solved what you needed. The query outputs just like you requested in your original post. – Hermit Jul 13 '12 at 06:43
  • ummm, I think it is getting confused more I guess. Let me explain you. This is my data in the SQL Fiddle [http://sqlfiddle.com/#!3/6f03a/1/0](http://sqlfiddle.com/#!3/6f03a/1/0). I wrote that query to fulfill my two scenarios, but it doesn't covers my third scenario. The SQL fiddle that you posted was wrong if you compare the data, so that is the reason your query will not work. Try running your SQL query in this data fiddle and see what output you are getting and compare it with my output. – arsenal Jul 13 '12 at 06:49
  • So if you look at this fiddle: http://sqlfiddle.com/#!3/d8a67/17 this shows how the data result matches your data example above. I have to admit, the wording of your third scenario is what is confusing me. Because as it is currently worded, you just want to include the 2 rows that exist in Table1, but not in Table2. This link shows that my query displays those two rows. – Hermit Jul 13 '12 at 06:57
  • Problem with your fiddle is you data is not right. And data will never be like this. Both the tables will always be sorted in descending order of the created_time. And both the tables will contain data for same date only. (Just now the table is, it will be like this only in real time). Each row in Table1 should be there in Table2, so the possibility are all the three cases that I mentioned above. As you can see data in Table1 first row is accurate, and same data should be there in Table2 first row also, but if you look first row in table2, data will not match, likewise for second row also. – arsenal Jul 13 '12 at 07:03