186

I'm looking for a way to do the equivalent to the SQL

SELECT DISTINCT col1, col2 FROM dataframe_table

The pandas sql comparison doesn't have anything about distinct.

.unique() only works for a single column, so I suppose I could concat the columns, or put them in a list/tuple and compare that way, but this seems like something pandas should do in a more native way.

Am I missing something obvious, or is there no way to do this?

smci
  • 32,567
  • 20
  • 113
  • 146
Jody
  • 8,021
  • 4
  • 26
  • 29
  • You'd have to do something like `df.apply(pd.Series.unique)` but this won't work if the number of unique values varies across the columns so you'd have to construct a dict of the column names as keys and the unique values as the values – EdChum May 29 '15 at 13:25
  • [SO Documentation](http://stackoverflow.com/documentation/pandas/1751/indexing-and-selecting-data/26077/select-distinct-rows-across-dataframe) – user2314737 Dec 03 '16 at 12:15

8 Answers8

290

You can use the drop_duplicates method to get the unique rows in a DataFrame:

In [29]: df = pd.DataFrame({'a':[1,2,1,2], 'b':[3,4,3,5]})

In [30]: df
Out[30]:
   a  b
0  1  3
1  2  4
2  1  3
3  2  5

In [32]: df.drop_duplicates()
Out[32]:
   a  b
0  1  3
1  2  4
3  2  5

You can also provide the subset keyword argument if you only want to use certain columns to determine uniqueness. See the docstring.

joris
  • 133,120
  • 36
  • 247
  • 202
  • 26
    Possibly worth noting that `df.drop_duplicates()` by default is not an inplace method, so returns a new DataFrame (leaving `df` unchanged). This is fairly standard behaviour, but may still be useful point out. – evophage Feb 04 '19 at 21:29
34

I've tried different solutions. First was:

a_df=np.unique(df[['col1','col2']], axis=0)

and it works well for not object data Another way to do this and to avoid error (for object columns type) is to apply drop_duplicates()

a_df=df.drop_duplicates(['col1','col2'])[['col1','col2']]

You can also use SQL to do this, but it worked very slow in my case:

from pandasql import sqldf
q="""SELECT DISTINCT col1, col2 FROM df;"""
pysqldf = lambda q: sqldf(q, globals())
a_df = pysqldf(q)
Yury Wallet
  • 1,474
  • 1
  • 13
  • 24
13

To solve a similar problem, I'm using groupby:

print(f"Distinct entries: {len(df.groupby(['col1', 'col2']))}")

Whether that's appropriate will depend on what you want to do with the result, though (in my case, I just wanted the equivalent of COUNT DISTINCT as shown).

ncoghlan
  • 40,168
  • 10
  • 71
  • 80
10

I think use drop duplicate sometimes will not be so useful depending dataframe.

I found this:

[in] df['col_1'].unique()
[out] array(['A', 'B', 'C'], dtype=object)

And worked for me!

https://riptutorial.com/pandas/example/26077/select-distinct-rows-across-dataframe

Naiara Andrade
  • 137
  • 1
  • 5
9

There is no unique method for a df, if the number of unique values for each column were the same then the following would work: df.apply(pd.Series.unique) but if not then you will get an error. Another approach would be to store the values in a dict which is keyed on the column name:

In [111]:
df = pd.DataFrame({'a':[0,1,2,2,4], 'b':[1,1,1,2,2]})
d={}
for col in df:
    d[col] = df[col].unique()
d

Out[111]:
{'a': array([0, 1, 2, 4], dtype=int64), 'b': array([1, 2], dtype=int64)}
EdChum
  • 376,765
  • 198
  • 813
  • 562
1

I think the following is the cleanest approach:

df.filter(items=['Column A', 'Column B']).drop_duplicates()
Bryun
  • 29
  • 3
0

You can take the sets of the columns and just subtract the smaller set from the larger set:

distinct_values = set(df['a'])-set(df['b'])
dorante
  • 11
0

I stumbled onto this question looking for the same solution but using Apache Spark for .NET (C#).

The following worked for me (given a list of CSV files in a folder):

string filePath = "file:///Users/me/dups/*";

var opts = new Dictionary<string, string>();
opts.Add("header", "true");
opts.Add("quote", "\"");
opts.Add("multiLine", "true");
opts.Add("sep", ",");

// load data
DataFrame df1 = spark.Read()
   .Options(opts)
   .Csv(filePath);

// columns to distinct on. in my case, there was 1 column I knew that was distinct per row so I listed all columns below minus that one.
Column[] columns = {
    Col("MyColumn1"),
    Col("MyColumn2"),
    // etc.
};

var distinctCount = df1
    .Select(columns)
    .Distinct()
    .Count();

Console.WriteLine(distinctCount);
Chris Smith
  • 688
  • 8
  • 20