Show distinct column values in pyspark dataframe

Question

With pyspark dataframe, how do you do the equivalent of Pandas df['col'].unique().

I want to list out all the unique values in a pyspark dataframe column.

Not the SQL type way (registertemplate then SQL query for distinct values).

Also I don't need groupby then countDistinct, instead I want to check distinct VALUES in that column.

score 381 · Answer 1 · edited Jun 12 '20 at 05:32

381

This should help to get distinct values of a column:

df.select('column1').distinct().collect()

Note that .collect() doesn't have any built-in limit on how many values can return so this might be slow -- use .show() instead or add .limit(20) before .collect() to manage this.

edited Jun 12 '20 at 05:32

MichaelChirico

33,841
14
113
198

answered Nov 01 '17 at 00:02

Pabbati

3,809
2
7
3

this code returns data that's not iterable, i.e. I see the distinct data bit am not able to iterate over it in code. Any other way that enables me to do it. I tried using toPandas() to convert in it into Pandas df and then get the iterable with unique values. However, running into '' Pandas not found' error message – Abhi Dec 17 '18 at 01:08
12

@Abhi: inplace of .show() instead do a .collect(), that way you will get a iterable of all the distinct values of that particular column. But make sure your master node have enough memory to keep hold of those unique values, because collect will push all the requested data(in this case unique values of column) to master Node :) – Satya Jan 16 '19 at 04:01
3

To obtain a Python **list of actual values** and not `Row` objects, you need to use a list comprehension like in this answer: https://stackoverflow.com/a/60896261/7465462 – Ric S Feb 23 '22 at 15:29
df.select('column').distinct().collect().toPandas().column.to_list() – travelingbones Oct 04 '22 at 02:28
1

`pyspark_df.select(target_column_name).distinct().toPandas()[target_column_name].to_list()` will give a flat python list. – Bikash Gyawali Oct 19 '22 at 11:40

score 122 · Accepted Answer · edited Jun 14 '23 at 20:54

122

Let's assume we're working with the following representation of data (two columns, k and v, where k contains three entries, two unique:

+---+---+
|  k|  v|
+---+---+
|foo|  1|
|bar|  2|
|foo|  3|
+---+---+

With a Pandas dataframe:

import pandas as pd
p_df = pd.DataFrame([("foo", 1), ("bar", 2), ("foo", 3)], columns=("k", "v"))
p_df['k'].unique()

This returns an ndarray, i.e. array(['foo', 'bar'], dtype=object)

You asked for a "pyspark dataframe alternative for pandas df['col'].unique()". Now, given the following Spark dataframe:

s_df = sqlContext.createDataFrame([("foo", 1), ("bar", 2), ("foo", 3)], ('k', 'v'))

If you want the same result from Spark, i.e. an ndarray, use toPandas():

s_df.toPandas()['k'].unique()

Alternatively, if you don't need an ndarray specifically and just want a list of the unique values of column k:

s_df.select('k').distinct().rdd.map(lambda r: r[0]).collect()

Finally, you can also use a list comprehension as follows:

[i for i in s_df.select('k').distinct().collect()]

edited Jun 14 '23 at 20:54

Daniel Moraite

422
5
8

answered Sep 08 '16 at 07:29

eddies

7,113
3
36
39

2

Hi eddies, the last code line distinct().map() didn't worked for me. Error:AttributeError: 'DataFrame' object has no attribute 'map'. I am on spark 2.0. And toPandas thing, i will not say it is an alternative, it converts spark dataframe to pandas dataframe first then doing pandas operation on it. – Satya Sep 08 '16 at 08:57
2

Hi satya. Just updated the answer by adding a `.rdd` call after `distinct()`. It worked without that in Spark 1.6.2, but I just confirmed that the edited answer works in Spark 2.0.0 as well. – eddies Sep 08 '16 at 11:01
7

Why try to avoid spark dataframe operations by converting to a pandas dataframe (hurts if its gigantic) or utilizing rdd operations when spark dataframes are perfectly capable of doing this? see below answer of @Pabbati – Laurens Koppenol Jan 05 '18 at 13:28
@Laurens There were three solutions in the answer above, depending on what the poster really wanted. In all cases, the poster wanted some form of a list/array of the distinct values (c.f. poster's response to seufagner's answer). The third solution above does use Spark's dataframe api just as Pabbati's answer but actually returns a list, as per the poster's requirements. – eddies Jan 09 '18 at 01:20
The question specifically asks how to "show" unique values of a dataframe column. I think @Pabbati 's answer does this effectively: df.select('column1').distinct().show() – RNHTTR Aug 28 '18 at 20:52
2

Yes, the question title includes the word "show". But the poster specifically clarified that SEEing the results wasn't adequate and wanted a list. As mentioned above, see the poster's comment to seufagner's answer. – eddies Aug 29 '18 at 03:43
good point. i suggested an edit to the question to ask for a list instead of to show the results. – RNHTTR Aug 29 '18 at 18:27
`[i.k for i in s_df.select('k').distinct().collect()]` ;I have a doubt. I only have column name in string format (for example 'k'). Is there any way to write `i.k` when I only have string column name. Actually column name is taken as a user input. That's why I'm asking this – ahrooran Jun 01 '20 at 13:08

score 24 · Answer 3 · edited Apr 03 '18 at 14:26

24

You can use df.dropDuplicates(['col1','col2']) to get only distinct rows based on colX in the array.

edited Apr 03 '18 at 14:26

answered May 30 '17 at 04:41

seufagner

1,290
2
18
25

3

@seufagner-yes I can do a df.dropDuplictes(['col1']) to see (mark SEE ) the unique values, but without a collect(to_rdd or to pandas DF then df['col'].unique()), I can't get the unique values list. Thanks for suggestion. – Satya May 30 '17 at 10:22
The user did not ask how to display non duplicate values.. He just wanted to get a list of all unique/distinct items, which includes duplicates too! – Utsav Jha Mar 06 '20 at 13:50
This is correct because df.select().collect() is an expensive operation which may lead to stage failure error. – Palash Mondal May 12 '22 at 07:15

score 20 · Answer 4 · edited Apr 16 '21 at 07:09

If you want to see the distinct values of a specific column in your dataframe, you would just need to write the following code. It would show the 100 distinct values (if 100 values are available) for the colname column in the df dataframe.

df.select('colname').distinct().show(100, False)

If you want to do something fancy on the distinct values, you can save the distinct values in a vector:

a = df.select('colname').distinct()

score 13 · Answer 5 · edited May 23 '22 at 19:13

13

collect_set can help to get unique values from a given column of pyspark.sql.DataFrame:

df.select(F.collect_set("column").alias("column")).first()["column"]

edited May 23 '22 at 19:13

Mykola Zotko

15,583
3
71
73

answered Jan 23 '19 at 06:51

Hari Baskar

416
7
12

2

I agree on using `collect_set`, but I think the following would be cleaner: ```df.agg(F.collect_set("column")).collect()[0][0]``` – stinodego May 28 '22 at 19:36

score 12 · Answer 6 · answered Mar 28 '20 at 01:32

12

you could do

distinct_column = 'somecol' 

distinct_column_vals = df.select(distinct_column).distinct().collect()
distinct_column_vals = [v[distinct_column] for v in distinct_column_vals]

answered Mar 28 '20 at 01:32

muon

12,821
11
69
88

This is amazingly simple, thank you – GenDemo Nov 16 '22 at 00:01
Just a quick comment: since you already selected `distinct_column` in the first line, you don't need to repeat it below. You can just do `[v[0] for v in distinct_column_vals]` – Ric S Jul 28 '23 at 08:29

score 4 · Answer 7 · answered Oct 04 '22 at 02:38

Similar to other answer, but the question doesn't seem to want Row objects returned, but instead actual values.

The ideal one-liner is df.select('column').distinct().collect().toPandas().column.to_list() assuming that running the .collect() isn't going to be too big for memory.

I recommend a df.select('column').distinct().count() first to estimate size, and make sure it's not too huge beforehand.

ansev · Answer 8 · 2019-12-23T12:42:01.287

In addition to the dropDuplicates option there is the method named as we know it in pandas drop_duplicates:

drop_duplicates() is an alias for dropDuplicates().

Example

s_df = sqlContext.createDataFrame([("foo", 1),
                                   ("foo", 1),
                                   ("bar", 2),
                                   ("foo", 3)], ('k', 'v'))
s_df.show()

+---+---+
|  k|  v|
+---+---+
|foo|  1|
|foo|  1|
|bar|  2|
|foo|  3|
+---+---+

Drop by subset

s_df.drop_duplicates(subset = ['k']).show()

+---+---+
|  k|  v|
+---+---+
|bar|  2|
|foo|  1|
+---+---+
s_df.drop_duplicates().show()


+---+---+
|  k|  v|
+---+---+
|bar|  2|
|foo|  3|
|foo|  1|
+---+---+

score 3 · Answer 9 · answered Mar 07 '20 at 14:49

3

If you want to select ALL(columns) data as distinct frrom a DataFrame (df), then

df.select('*').distinct().show(10,truncate=False)

answered Mar 07 '20 at 14:49

Kapil Sharma

41
1
5

score 2 · Answer 10 · answered May 11 '20 at 10:01

2

Run this first

df.createOrReplaceTempView('df')

Then run

spark.sql("""
    SELECT distinct
        column name
    FROM
        df
    """).show()

answered May 11 '20 at 10:01

Joseph Jacob

163
1
7

THe query is not to use SQL syntax within pyspark. A pure pyspark way of distinct was the ask. – Chandra Sep 27 '22 at 10:15

score 2 · Answer 11 · answered Feb 08 '23 at 18:53

I find this to be the most intuitive (i.e., similar to Pandas) syntax:

new_df = df[['col']].distinct() # returns a one-column pyspark dataframe

or

new_col = df[['col']].distinct()['col'] # returns a pyspark column object.

So to get those to a list you need:

new_list = df[['col']].distinct().rdd.map(lambda row: row.col).collect()

score 1 · Answer 12 · answered May 23 '23 at 13:44

1

This can be helpful

df.select(collect_set("your column")).show()

answered May 23 '23 at 13:44

Afonso O. Lenzi

99
1
8

score 0 · Answer 13 · answered Mar 11 '21 at 11:12

0

Let us suppose that your original DataFrame is called df. Then, you can use:

df1 = df.groupBy('column_1').agg(F.count('column_1').alias('trip_count'))
df2 = df1.sort(df1.trip_count.desc()).show()

answered Mar 11 '21 at 11:12

Marioanzas

1,663
2
10
33

score 0 · Answer 14 · answered Aug 21 '23 at 07:57

0

I did this :

unique_values = df.select("col").distinct().rdd.flatMap(lambda x: x).collect()
print(unique_values)

answered Aug 21 '23 at 07:57

Aymen Azoui

369
2
4

Show distinct column values in pyspark dataframe

14 Answers14

Linked

Related