1

A common operation in pandas is something such as

In [14]: import io

In [15]: csv='''\
    ...: a,b
    ...: 1,2
    ...: 1,3
    ...: 2,3
    ...: 3,1
    ...: 3,3'''

In [16]: dt = pd.read_csv(io.StringIO(csv))

In [17]: dt
Out[17]:
   a  b
0  1  2
1  1  3
2  2  3
3  3  1
4  3  3

In [18]: dt.drop_duplicates(subset = ['a'])
Out[18]:
   a  b
0  1  2
2  2  3
3  3  1

How can this be performed in SQL though? Is there either a standard function or approach to doing what drop_duplicates(subset = <list>) does?

Edit

How pandas duplicate function works:

In [20]: dt['a'].duplicated()
Out[20]:
0    False
1     True
2    False
3    False
4     True
Name: a, dtype: bool

In [21]: dt.drop_duplicates(subset=['a'])
Out[21]:
   a  b
0  1  2
2  2  3
3  3  1
BENY
  • 317,841
  • 20
  • 164
  • 234
baxx
  • 3,956
  • 6
  • 37
  • 75

1 Answers1

1

I think you want:

select a, b
from (select t.*, row_number() over (partition by a order by b) as seqnum
      from t
     ) t
where seqnum = 1;

Note that SQL tables represent unordered sets, unlike dataframes. There is no "first" row unless a column specifies the ordering.

If you don't care about the rows, you can also use aggregation:

select a, min(b) as b
from t
group by a;
Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786