How to create drop_duplicates in a SQL query?

Question

A common operation in pandas is something such as

In [14]: import io

In [15]: csv='''\
    ...: a,b
    ...: 1,2
    ...: 1,3
    ...: 2,3
    ...: 3,1
    ...: 3,3'''

In [16]: dt = pd.read_csv(io.StringIO(csv))

In [17]: dt
Out[17]:
   a  b
0  1  2
1  1  3
2  2  3
3  3  1
4  3  3

In [18]: dt.drop_duplicates(subset = ['a'])
Out[18]:
   a  b
0  1  2
2  2  3
3  3  1

How can this be performed in SQL though? Is there either a standard function or approach to doing what drop_duplicates(subset = <list>) does?

Edit

How pandas duplicate function works:

In [20]: dt['a'].duplicated()
Out[20]:
0    False
1     True
2    False
3    False
4     True
Name: a, dtype: bool

In [21]: dt.drop_duplicates(subset=['a'])
Out[21]:
   a  b
0  1  2
2  2  3
3  3  1

@GordonLinoff typically not really - I've added in an example of how pandas select things though — baxx, Aug 09 '20 at 16:59

Gordon Linoff · Answer 1 · 2020-08-09T17:09:16.563

1

I think you want:

select a, b
from (select t.*, row_number() over (partition by a order by b) as seqnum
      from t
     ) t
where seqnum = 1;

Note that SQL tables represent unordered sets, unlike dataframes. There is no "first" row unless a column specifies the ordering.

If you don't care about the rows, you can also use aggregation:

select a, min(b) as b
from t
group by a;

edited Aug 09 '20 at 17:09

answered Aug 09 '20 at 16:57

Gordon Linoff

1,242,037
58
646
786

How to create drop_duplicates in a SQL query?

Edit

1 Answers1