0

So I have two tables (let's say x and y). Most of the data from both table are duplicated but there are some rows that are different. I insert whole data from those two tables into a new table (let's say table_mixed). There's one column that indicate table's date e.g. 20190307 for x and 20190308 for y So, for any duplicated rows, there will be a date column that is different.

num        Code     col1 col2 col3.....  import_date    file_date   
-------- ---------  -----------------   ----------   ----------
01         AA       ......                20190308          20190307      
01         AA       ......                20190308          20190308      
02         AA       ......                20190308          20190307      
03         BB       ......                20190308          20190308      

What I am trying to do is, I want to query a data such that, show a non-duplicated row from both table and for any duplicated row, shows only a row with recent date.

I have done some finding and I have tried this:

select *,max(file_date) over (partition by stx_import_date) max_date 
from table_mixed;

where file_date is a date that tell the different date from each table and every row from both table has the same import_date.

num        Code     col1 col2 col3......  import_date    file_date     max_date 
-------- ---------         ------------     ----------  ----------
01         AA       ......                20190308        20190307     20190308
01         AA       ......                20190308        20190308     20190308
02         AA       ......                20190308        20190307     20190307
03         BB       ......                20190308        20190308     20190308

The result from this query show every row (including all duplicated row) and add another column (max_date) that show only the recent file_date for each of the row. But I want the result to show only what I mentioned above and no additional column (max_date).

This is the result that I am looking for:

num        Code     col1 col2 col3...  import_date  file_date   
-------- ---------    ------------    ----------   --------   
01         AA       ......            20190308        20190308      
02         AA       ......            20190308        20190307      
03         BB       ......            20190308        20190308      

Thank you

P.S Not only column num, code and import date that need to be duplicated but also other columns that I ..... So, what I mean duplicated row >> every column except file_date (which I have 10+ columns)

P.S.2 I edited the example so that you guy want get me wrong. There're other columns (like col1, col2, col3 and so on) that also used. How should I use partitioned by in this case

emp
  • 602
  • 3
  • 11
  • 22

3 Answers3

0

Use row_number window function.

  • Do partition on num,code,import_date..etc columns.
  • Order by on file_date desc

Sample query:

Select * from (
select *,row_number() over (partition by num,code,stx_import_date order by file_date desc) row_number
from table_mixed)t
where t.row_number = 1;
notNull
  • 30,258
  • 4
  • 35
  • 50
  • I also found something related to this row_number function. The actual data that I am working with is not just num,code, stx_import_date but every other column. So, do I have to partition by every column(like 15-20 columns) – emp Mar 10 '19 at 13:16
  • like are there any efficient way than typing almost all of the column to be partitioned? – emp Mar 10 '19 at 13:42
  • @Jamiewp, Possible other way would be using Hive variables and define variable value as your column names `num,code,stx_import_date..etc` and reuse them, https://stackoverflow.com/questions/12464636/how-to-set-variables-in-hive-scripts, apart from this i don't think there another way instead we need to mention all column names in `partition by` clause. – notNull Mar 10 '19 at 14:02
0

You seem to care about the num column and only want the most recent data. You can do this with your mixed table as:

select tm.*
from (select tm.*,
             row_number() over (partition by num, code, . . . order by file_date desc) as seqnum
      from table_mixed
     ) tm
where seqnum = 1;

Note: If the file dates are the same, then an arbitrary row will be chosen.

This may be more efficient to do when you create the mixed table. You can just do:

select y.*
from y
union all
select x.*
from x left join
     y
     on x.num = y.num and
        x.code = y.code and
        . . .
where y.num is null;

This returns all rows from y (the more recent) along with any non-matching rows from x (and assumes none of the column values are NULL).

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • The change in num column is just for the example. Actually the non-duplicated row I mentioned can be any of the column that different. – emp Mar 10 '19 at 13:12
  • Thank you for helping. The point is that I have too many columns to be used to check for duplicate. Are there any more efficient way other than typing all column except the file_data? – emp Mar 10 '19 at 13:46
  • @Jamiewp . . . No, there is not in Hive unless you start using wildcards for columns. And 15 columns or so is not very much to type. You've typed a lot more in asking the question here. – Gordon Linoff Mar 10 '19 at 13:47
  • Sorry, this might be more than I asked but what I aim to do is that, I have many files which each file have its own schema so, one might have id, code, num another might have name, address, etc. So, if there is a query that no need to specify any column name separately, it would be good(I will need to integrate with Python script later on) – emp Mar 10 '19 at 13:54
  • @Jamiewp . . . That is not the question that you asked. Changing this question would invalidate the answers of people who have tried to help you. So, I would recommend that you ask another question and be as clear as you can be regarding the data your have and what you want to do. – Gordon Linoff Mar 10 '19 at 14:16
0

Maybe grouping by the MAX() of whatever date field will be the differentiator?

;WITH get_max_dt AS (
    SELECT TM.[num]
    ,   TM.[Code]
    ,   TM.[import_date]
    ,   MAX(TM.[file_date]) AS [file_date]
    FROM table_mixed AS TM
    GROUP BY TM.[num],TM.[Code],TM.[import_date]

)
SELECT *
FROM get_max_dt

Output:

num  Code  import_date  file_date
01   AA    20190308     20190308
02   AA    20190308     20190307
03   BB    20190308     20190308
Mark Moretto
  • 2,344
  • 2
  • 15
  • 21
  • I edited the question. The differentiator is every other 10+ columns except the file_date column – emp Mar 10 '19 at 13:37
  • @Jamiewp . . .And ??? Aggregation should work for what you want (as well as `row_number()` in the other answers). – Gordon Linoff Mar 10 '19 at 13:41