How to "select distinct" across multiple data frame columns in pandas?

Question

I'm looking for a way to do the equivalent to the SQL

SELECT DISTINCT col1, col2 FROM dataframe_table

The pandas sql comparison doesn't have anything about distinct.

.unique() only works for a single column, so I suppose I could concat the columns, or put them in a list/tuple and compare that way, but this seems like something pandas should do in a more native way.

Am I missing something obvious, or is there no way to do this?

You'd have to do something like `df.apply(pd.Series.unique)` but this won't work if the number of unique values varies across the columns so you'd have to construct a dict of the column names as keys and the unique values as the values — EdChum, May 29 '15 at 13:25
[SO Documentation](http://stackoverflow.com/documentation/pandas/1751/indexing-and-selecting-data/26077/select-distinct-rows-across-dataframe) — user2314737, Dec 03 '16 at 12:15

score 290 · Accepted Answer · answered May 29 '15 at 14:18

290

You can use the drop_duplicates method to get the unique rows in a DataFrame:

In [29]: df = pd.DataFrame({'a':[1,2,1,2], 'b':[3,4,3,5]})

In [30]: df
Out[30]:
   a  b
0  1  3
1  2  4
2  1  3
3  2  5

In [32]: df.drop_duplicates()
Out[32]:
   a  b
0  1  3
1  2  4
3  2  5

You can also provide the subset keyword argument if you only want to use certain columns to determine uniqueness. See the docstring.

answered May 29 '15 at 14:18

joris

133,120
36
247
202

26

Possibly worth noting that `df.drop_duplicates()` by default is not an inplace method, so returns a new DataFrame (leaving `df` unchanged). This is fairly standard behaviour, but may still be useful point out. – evophage Feb 04 '19 at 21:29

Yury Wallet · Answer 2 · 2018-11-16T11:20:05.130

I've tried different solutions. First was:

a_df=np.unique(df[['col1','col2']], axis=0)

and it works well for not object data Another way to do this and to avoid error (for object columns type) is to apply drop_duplicates()

a_df=df.drop_duplicates(['col1','col2'])[['col1','col2']]

You can also use SQL to do this, but it worked very slow in my case:

from pandasql import sqldf
q="""SELECT DISTINCT col1, col2 FROM df;"""
pysqldf = lambda q: sqldf(q, globals())
a_df = pysqldf(q)

score 13 · Answer 3 · answered Dec 14 '18 at 04:34

To solve a similar problem, I'm using groupby:

print(f"Distinct entries: {len(df.groupby(['col1', 'col2']))}")

Whether that's appropriate will depend on what you want to do with the result, though (in my case, I just wanted the equivalent of COUNT DISTINCT as shown).

Naiara Andrade · Answer 4 · 2021-12-16T12:56:38.790

10

I think use drop duplicate sometimes will not be so useful depending dataframe.

I found this:

[in] df['col_1'].unique()
[out] array(['A', 'B', 'C'], dtype=object)

And worked for me!

https://riptutorial.com/pandas/example/26077/select-distinct-rows-across-dataframe

edited Dec 16 '21 at 12:56

answered Nov 07 '18 at 13:19

Naiara Andrade

137
1
5

score 9 · Answer 5 · answered May 29 '15 at 13:32

9

There is no unique method for a df, if the number of unique values for each column were the same then the following would work: df.apply(pd.Series.unique) but if not then you will get an error. Another approach would be to store the values in a dict which is keyed on the column name:

In [111]:
df = pd.DataFrame({'a':[0,1,2,2,4], 'b':[1,1,1,2,2]})
d={}
for col in df:
    d[col] = df[col].unique()
d

Out[111]:
{'a': array([0, 1, 2, 4], dtype=int64), 'b': array([1, 2], dtype=int64)}

answered May 29 '15 at 13:32

EdChum

376,765
198
813
562

is it possible to check for unique for multiple columns ? – Anoop D Feb 26 '19 at 07:08
Got the answer from another SO question using numpy `np.unique(df[['column1','column2']].values)` – Anoop D Feb 26 '19 at 07:14

score 1 · Answer 6 · answered Feb 23 '23 at 16:06

1

I think the following is the cleanest approach:

df.filter(items=['Column A', 'Column B']).drop_duplicates()

answered Feb 23 '23 at 16:06

Bryun

29
3

score 0 · Answer 7 · answered Sep 06 '17 at 14:56

0

You can take the sets of the columns and just subtract the smaller set from the larger set:

distinct_values = set(df['a'])-set(df['b'])

answered Sep 06 '17 at 14:56

dorante

11

Chris Smith · Answer 8 · 2023-06-03T22:45:55.050

I stumbled onto this question looking for the same solution but using Apache Spark for .NET (C#).

The following worked for me (given a list of CSV files in a folder):

string filePath = "file:///Users/me/dups/*";

var opts = new Dictionary<string, string>();
opts.Add("header", "true");
opts.Add("quote", "\"");
opts.Add("multiLine", "true");
opts.Add("sep", ",");

// load data
DataFrame df1 = spark.Read()
   .Options(opts)
   .Csv(filePath);

// columns to distinct on. in my case, there was 1 column I knew that was distinct per row so I listed all columns below minus that one.
Column[] columns = {
    Col("MyColumn1"),
    Col("MyColumn2"),
    // etc.
};

var distinctCount = df1
    .Select(columns)
    .Distinct()
    .Count();

Console.WriteLine(distinctCount);

How to "select distinct" across multiple data frame columns in pandas?

8 Answers8

Linked

Related