1

enter image description here

I have this data where I have to remove the duplicate data highlighted in yellow. I tried to use distinct and group by but it does not seem to work.

Even though the comment and value is different, anything having the value as CRIMSSTANDARD is considered a child of the parent in this case and hence considered as a duplicate as the first 6 fields are the same. How do I remove the duplicate records in this case?

CLIENT;CLIENT_ID;QTY_TYPE;QUANTITY;AMOUNT;TRANTYPE;COMMENT;VALUE;ORDER_ID;DEV_COMM;AMT_COMM;ID
TEST_IMPL8;P-AGNIndPlusFd;A;;140000;Sell;INS;CRIMSSTANDARD;10902601;;;
TEST_IMPL8;P-AGNIndPlusFd;Q;2898;;Buy;INS;CRIMSSTANDARD;10902701;;;
TEST_IMPL8;P-AGNIndPlusFd;A;;140000;Sell;INS;CRIMSSTANDARD;10902601;;;
TEST_IMPL8;P-DepVz;A;;82000;Sell;GEN,TransferOrder;10902568;Y;0;
TEST_IMPL8;P-DepVz;A;;82000;Sell;INS;CRIMSSTANDARD;10902568;;;
TEST_IMPL8;P-DivBnd2Vv;A;;4862.09;Buy;GEN;OHPEN_14613296BD001571;10902668;;;14613296BD001571
TEST_IMPL8;P-DivBnd2Vv;A;;4862.09;Buy;INS;CRIMSSTANDARD;10902668;;;
TEST_IMPL8;P-Dyn4Vz;A;;13000;Buy;INS;CRIMSSTANDARD;10877286;;;
TEST_IMPL8;P-EmuCoInsBVv;Q;524.6892;;Sell;GEN;OHPEN_14613296BD001565;10677375;;;14613296BD001565
TEST_IMPL8;P-EmuCoInsBVv;Q;524.6892;;Sell;INS;CRIMSSTANDARD;10677375;;;
TEST_IMPL8;P-VNAMijBel;Q;10236;;Sell;INS;CRIMSSTANDARD;11402183;;;
halfer
  • 19,824
  • 17
  • 99
  • 186
drv236
  • 31
  • 1
  • 4

2 Answers2

1

There is no distinct on in Snowflake, but you can have a similar result using qualify:

SELECT * FROM my_table
QUALIFY ROW_NUMBER() OVER (
  PARTITION BY client, client_id, qty_type, quantity, amount, trantype, value 
  ORDER BY client, client_id, qty_type, quantity, amount, trantype, value
) = 1;

See here to see more details

EDIT:

To keep the rows that have the dev_comm, amt_comm and ID non null in priority you can modify your order condition:

SELECT * FROM my_table
QUALIFY ROW_NUMBER() OVER (
  PARTITION BY client, client_id, qty_type, quantity, amount, trantype, value 
  ORDER BY 
    IFF(dev_comm IS NOT NULL, 0, 1), 
    IFF(amt_comm IS NOT NULL, 0, 1), 
    IFF(ID IS NOT NULL, 0, 1)
) = 1;

You can also have different priorities by ordering the fields in the order by. For example in case you have multiple lines that have the dev_comm columns but you want in priority the line that have the ID, you need to put the iff(ID ... first in your order by. It order by the first element, then if there is several lines with the same first element it order by the second, etc...

robin loche
  • 276
  • 2
  • 7
  • Hi, I tried this and I get 8 records but I am getting the records included that I wanted to remove and the unique records with the dev_comm,amt_comm and ID fields are getting deleted instead. Is there a workaround to this? – drv236 May 11 '22 at 20:50
  • 1
    Please post sample data in text, so someone trying to answer can use and test same easily. – Pankaj May 11 '22 at 23:12
  • You can use the one shared by @robin, replace '=' with '>' and also add condition as per your question before QUALIFY clause - where value = 'CRIMSSTANDARD' – Pankaj May 11 '22 at 23:25
  • @Pankaj added sample data in text – drv236 May 12 '22 at 01:12
  • @drv236 I edited my answer to order to have you columns not null in priority – robin loche May 13 '22 at 16:02
1

First pass, you'll need to get rid of rows that are entire duplicates of each other (this will fix the row 1 and row 3 issue in your data where the rows are exactly the same). Use the approach outlined here: How to delete Duplicate records in snowflake database table

Next, try this to eliminate duplicates of first 6 columns, and deleting the child record designated by CRIMSSTANDARD in the COMMENT column:

begin work;
delete from
    my_table using (
        select
            CLIENT,
            CLIENT_ID,
            QTY_TYPE,
            QUANTITY,
            AMOUNT,
            TRANTYPE
        from
            my_table
        GROUP BY
            CLIENT,
            CLIENT_ID,
            QTY_TYPE,
            QUANTITY,
            AMOUNT,
            TRANTYPE
        HAVING
            COUNT(*) > 1
    ) as my_table_dupes
where
    my_table.CLIENT = X.CLIENT
    AND my_table.CLIENT_ID = my_table_dupes.CLIENT_ID
    AND my_table.QTY_TYPE = my_table_dupes.QTY_TYP
    AND my_table.QUANTITY = my_table_dupes.QUANTITY
    AND my_table.AMOUNT = my_table_dupes.AMOUNT
    AND my_table.TRANTYPE = my_table_dupes.TRANTYPE
    AND my_table.COMMENT = 'CRIMSSTANDARD' 
commit work;
Jim Demitriou
  • 593
  • 4
  • 8