Remove duplicate from a file or table

Question

I have data in a database table and I'm exporting this to a file like this and there are around 100k records (this is kind of duplicate based on id)

            id     |  dp_1   | pp_1  | Phone  |
            -------|---------|-------|--------|
            1      | dp1     |       | phone1 |
            ----------------------------------|
            1      |         | pp1   | phone1 |
            ----------------------------------|
            2      |  dp2    | pp2   | phone2 |
            ------------------------------------
            2      |         |       | phone4 |
            -----------------------------------
            3      |  dp3    | pp3   | phone3 |
            ------------------------------------
            3      |  dp3    |       | phone3 |
            -----------------------------------
            4      |         | pp4   |        |
            ------------------------------------
            4      |  dp4    |       |        |

I wanted the result to be as below:

            id     |  dp_1   | pp_1  | Phone           |
            -------|---------|-------|-----------------|
            1      | dp1     |  pp1  | phone1 - phone1 |
            -------------------------------------------|
            2      | dp2     | pp2   | phone2 - phone4 |
            -------------------------------------------|
            3      | dp3     | pp3   | phone3          |
            -------------------------------------------|

            4      |   dp4   | pp4   |                 |
            --------------------------------------------

I have written this SQL:

WITH cte AS (
  SELECT*, 
 row_number() OVER(PARTITION BY id,DP_1, PP_1, phone ORDER BY id desc) 
 AS [rn]
 FROM table1
   )
Select * into #temp from cte WHERE [rn] = 1 ORDER BY id

How can I achieve this in Python OR using SQL query? I'm using Anaconda.

Why does id 3 only have `phone3`, but id 1 has `phone1 - phone1`? Why is the logic different for these 2? — Thom A, Dec 22 '17 at 10:30
Refereed here: https://stackoverflow.com/questions/194852/concatenate-many-rows-into-a-single-text-string Need to still modify few more things. — AskMe, Dec 22 '17 at 14:00

Thom A · Answer 1 · 2017-12-22T14:15:02.923

1

I don't understand why id 1 and 3 have different logic for phone (one duplicates the number, one does not). This answer can either duplicate the phone (like in id 1) or return the DISTINCT values (like id 3). You can change the logic by uncommenting the GROUP BY.

--Sample Data
WITH VTE AS (
    SELECT *
    FROM (VALUES (1,'dp1',NULL,'phone1'),
                 (1,NULL,'pp1','phone1'),
                 (2,'dp2','pp2','phone2'),
                 (2,NULL,NULL,'phone4'),
                 (3,'dp3','pp2','phone3'),
                 (3,'dp3',NULL,'phone3')) V(id, dp_1, pp_1, phone))
--And the answer
SELECT id,
       MAX(dp_1) AS dp_1,
       MAX(pp_1) AS pp_1,
       STUFF((SELECT ' - ' + sq.phone 
              FROM VTE sq
              WHERE sq.id = VTE.id
                AND phone <> ''
              --GROUP BY sq.phone --If you only want to display unique phones, uncomment the GROUP BY.
              FOR XML PATH('')),1,3,'') AS [phone]
FROM VTE
GROUP BY id;

edited Dec 22 '17 at 14:15

answered Dec 22 '17 at 10:41

Thom A

88,727
11
45
75

Looking like problem is here when phone number is blank. Just modified the question for blank phone number scenario. I have so may columns like this (just collected few as an example) – AskMe Dec 22 '17 at 12:56
@AskMe blank or NULL? Also, what is the "problem"? – Thom A Dec 22 '17 at 13:38
@Askme I have modified the query for if the value of `phone` is blank (this is **not** the same as *`NULL`*; as you specified blank I have only catered for that. Should you require logic for *`NULL`* I suggest you add a `IS NOT NULL` and (probably) an `ISNULL` around the entire subquery ). No further logic changes in regards to ID 1 and 3, as I am yet to have a response on this. Thanks. – Thom A Dec 22 '17 at 14:17

Sreenu131 · Answer 2 · 2017-12-22T10:49:15.030

This Query gives your expected Result

;With cte( id,dp_1,pp_1,Phone)
AS
(            
 SELECT 1 ,  'dp1' , NULL   , 'phone1'   UNION ALL
 SELECT 1 ,   NULL , 'pp1'  , 'phone1'   UNION ALL
 SELECT 2 ,  'dp2' , 'pp2'  , 'phone2'   UNION ALL
 SELECT 2 ,   NULL ,  NULL  , 'phone4'   UNION ALL
 SELECT 3 ,  'dp3' , 'pp3'  , 'phone3'   UNION ALL
 SELECT 3 ,  'dp3' ,  NULL  , 'phone3'   
 )
 SELECT 
     DISTINCT id  , 
     MAX(dp_1)OVER(PARTITION BY id ORDER BY id) AS dp_1 ,
     MAX(pp_1)OVER(PARTITION BY id ORDER BY id) AS pp_1,
 STUFF((SELECT DISTINCT  ' - ' + Phone  FROM cte i WHERE i.id=o.id
FOR XML PATH ('')),1,2,'') AS Phone
FROM cte o

Result

id  dp_1     pp_1    Phone
--------------------------------
1   dp1      pp1     phone1 
2   dp2      pp2     phone2 - phone4
3   dp3      pp3     phone3

score 0 · Answer 3 · answered Dec 22 '17 at 10:43

In Python the best solution in your case is pandas. I also use numpy to select unique variables for the "phones" in your case

First, I just create your table (reading from SQL is a separate question I guess)

df = pd.DataFrame(data={'id': [1, 1, 2, 2, 3, 3],
                        'dp_1': ['dp1', np.nan, 'dp2', np.nan, 'dp3', 'dp3'],
                        'pp_1': [np.nan, 'pp1', 'pp2', np.nan, 'pp3', np.nan],
                        'Phone': ['phone1 ', 'phone1 ', 'phone2 ', 'phone4 ', 'phone2 ', 'phone3 ']})

Then I create a function which will be applied in grouping

def unique_sum(str_list):
    return np.sum(np.unique(str_list))

and then apply groupby. I hope that's what you need

df.groupby('id').aggregate({'dp_1': 'last', 'pp_1': 'last', 'Phone': unique_sum})


    pp_1           Phone dp_1
id                          
1   pp1         phone1   dp1
2   pp2  phone2 phone4   dp2
3   pp3  phone2 phone3   dp3

Remove duplicate from a file or table

3 Answers3