Python DataFrame count how many different elements

Question

I need to count how many different elements are in my DataFrame (df).

My df has the day of the month (as a number: 1,2,3 ... 31) in which a certain variable was measured. There are 3 columns that describe the number of the day. There are multiple measurements in one day so my columns have repeated values. I need to know how many days in a month was that variable measured ignoring how many times a day was that measurement done. So I was thinking that counting the days ignoring repeated values.

As an example the data of my df would look like this:

col1 col2 col3   
 2    2   2
 2    2   3
 3    3   3
 3    4   8

I need an output that tells me that in that DataFrame the numbers are 2, 3, 4 and 8.

Thanks!

Please provide a sample of your dataset as well as the expected results. Thanks. — S3DEV, Mar 05 '20 at 21:04
well I cant work out how you get 6 measurements and 3 measurements — Moo10000, Mar 05 '20 at 21:05
What is the issue, exactly? Have you tried anything, done any research? Stack Overflow is not a free code writing service. See: [tour], [ask], [help/on-topic], https://meta.stackoverflow.com/questions/261592/how-much-research-effort-is-expected-of-stack-overflow-users. — AMC, Mar 06 '20 at 01:52

score 3 · Answer 1 · answered Mar 05 '20 at 21:27

3

Just do:

df=pd.DataFrame({"col1": [2,2,3,3], "col2": [2,2,3,4], "col3": [2,3,3,8]})

df.stack().unique()

Outputs:

[2 3 4 8]

answered Mar 05 '20 at 21:27

Grzegorz Skibinski

12,624
2
11
34

Do you know how this compares performance-wise to `pd.unique(df.to_numpy().flatten())` ? – AMC Mar 06 '20 at 01:58
Less operations- for starters. It's purely ```pandas``` solution (even though it returns ```numpy```), instead of jumping between ```pandas```, ```numpy```, and again ```pandas```. – Grzegorz Skibinski Mar 06 '20 at 09:33
I'm running some benchmarks, and so far `df.stack().unique()` actually appears to be a bit slower. WIP – AMC Mar 06 '20 at 23:08
As far as I can tell `pd.unique(df.to_numpy().flatten())` is actually consistently faster than `df.stack().unique()`. – AMC Mar 07 '20 at 01:04

score 1 · Answer 2 · answered Mar 05 '20 at 21:02

1

You can use the function drop_duplicates into your dataframe, like:

import pandas as pd
df = pd.DataFrame({'a':[2,2,3], 'b':[2,2,3], 'c':[2,2,3]})

   a  b  c
0  2  2  2
1  2  2  2
2  3  3  3

df = df.drop_duplicates()
print(df['a'].count())
out: 2

answered Mar 05 '20 at 21:02

xavi

65
5

Shouldn’t it be `df.count()`, since we want the total number of unique values across the entire DataFrame? – AMC Mar 08 '20 at 00:54

Kempie · Answer 3 · 2020-03-05T21:43:16.287

1

Or you can use numpy to get the unique values in the dataframe:

import pandas as pd
import numpy as np

df = pd.DataFrame({'X' : [2, 2, 3, 3], 'Y' : [2,2,3,4], 'Z' : [2,3,3,8]})

df_unique = np.unique(np.array(df))

print(df_unique) 
#Output [2 3 4 8]
#for the count of days:
print(len(df_unique))
#Output 4

edited Mar 05 '20 at 21:43

answered Mar 05 '20 at 21:23

Kempie

263
2
10

teoeme139 · Answer 4 · 2020-03-05T21:23:45.877

0

How about:

Assuming this is your initial df:

   col1  col2  col3
0     2     2     2
1     2     2     2
2     3     3     3

Then:

count_df = pd.DataFrame()

for i in df.columns:
    df2 = df[i].value_counts()
    count_df = pd.concat([count_df, df2], axis=1)

final_df = count_df.sum(axis=1)
final_df = pd.DataFrame(data=final_df, columns=['Occurrences'])
print(final_df)

   Occurrences
2            6
3            3

edited Mar 05 '20 at 21:23

answered Mar 05 '20 at 21:15

teoeme139

412
3
11

Edit: Converted series to df to make it look tidier – teoeme139 Mar 05 '20 at 21:24
As your question has changed, this answer no longer applies. – teoeme139 Mar 05 '20 at 21:58
Repeatedly concatenating like that can destroy performance, it's probably best to add all the parts to a temporary data structure and concat once. Although the ideal would be to avoid explicitly looping entirely, it's completely unnecessary. – AMC Mar 06 '20 at 01:59
Even when his question was changed after my answer, I tested 3 different approaches and this was the fastest. I tested with a 10,000,000 x 3 dataframe and got a faster time with this method than using stack() or to_numpy.flatten() and counting occurrences. I disagree with your downvote here – teoeme139 Mar 06 '20 at 15:07
That’s certainly unexpected, I’ll take a look at the performance in a bit. – AMC Mar 06 '20 at 17:06
Please, share results. It took 0.8 seconds to run the above code whereas it took 1.6ish seconds to run the other two alternatives. – teoeme139 Mar 06 '20 at 17:15
Did you make any tweaks to the various solutions in order to benchmark them? Yours returns both the unique values and their frequency, not just the total number of unique values. – AMC Mar 06 '20 at 19:07
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/209179/discussion-between-teoeme139-and-amc). – teoeme139 Mar 06 '20 at 19:21

AMC · Answer 5 · 2020-03-06T23:27:51.320

0

You can use pandas.unique() like so:

pd.unique(df.to_numpy().flatten())

I have done some basic benchmarking, this method appears to be the fastest.

edited Mar 06 '20 at 23:27

answered Mar 06 '20 at 01:53

AMC

2,642
7
13
35

adding the benchmarking to the post would be useful – baxx Mar 08 '20 at 00:09
1

@baxx Yes, I intend on doing so tomorrow! – AMC Mar 08 '20 at 00:52
I'd be interested on the use of `to_numpy()` over `values` here as well, perhaps it's just whatever was first to mind. I couldn't think of a reason for one over the other though (perhaps that will be clear with the benchmarks?) – baxx Mar 08 '20 at 00:57
1

@baxx It isn’t for performance reasons, although I am now curious to know if there is one, you can find a good explanation of `.values` and `.to_numpy()` [here](https://stackoverflow.com/a/54508052/11301900). – AMC Mar 08 '20 at 01:08
ah, I hadn't seen that information, thanks for the link. I'll stop using values, and think that's good reason to use `to_numpy` instead. – baxx Mar 08 '20 at 01:45
@baxx Until I do manage to write a proper answer, you can find some of my benchmarks posted [here](https://chat.stackoverflow.com/rooms/209179/discussion-between-teoeme139-and-amc). – AMC Mar 09 '20 at 02:23

Python DataFrame count how many different elements

5 Answers5