0

I need to count how many different elements are in my DataFrame (df).

My df has the day of the month (as a number: 1,2,3 ... 31) in which a certain variable was measured. There are 3 columns that describe the number of the day. There are multiple measurements in one day so my columns have repeated values. I need to know how many days in a month was that variable measured ignoring how many times a day was that measurement done. So I was thinking that counting the days ignoring repeated values.

As an example the data of my df would look like this:

col1 col2 col3   
 2    2   2
 2    2   3
 3    3   3
 3    4   8

I need an output that tells me that in that DataFrame the numbers are 2, 3, 4 and 8.

Thanks!

5 Answers5

3

Just do:

df=pd.DataFrame({"col1": [2,2,3,3], "col2": [2,2,3,4], "col3": [2,3,3,8]})

df.stack().unique()

Outputs:

[2 3 4 8]
Grzegorz Skibinski
  • 12,624
  • 2
  • 11
  • 34
  • Do you know how this compares performance-wise to `pd.unique(df.to_numpy().flatten())` ? – AMC Mar 06 '20 at 01:58
  • Less operations- for starters. It's purely ```pandas``` solution (even though it returns ```numpy```), instead of jumping between ```pandas```, ```numpy```, and again ```pandas```. – Grzegorz Skibinski Mar 06 '20 at 09:33
  • I'm running some benchmarks, and so far `df.stack().unique()` actually appears to be a bit slower. WIP – AMC Mar 06 '20 at 23:08
  • As far as I can tell `pd.unique(df.to_numpy().flatten())` is actually consistently faster than `df.stack().unique()`. – AMC Mar 07 '20 at 01:04
1

You can use the function drop_duplicates into your dataframe, like:

import pandas as pd
df = pd.DataFrame({'a':[2,2,3], 'b':[2,2,3], 'c':[2,2,3]})

   a  b  c
0  2  2  2
1  2  2  2
2  3  3  3

df = df.drop_duplicates()
print(df['a'].count())
out: 2
xavi
  • 65
  • 5
  • Shouldn’t it be `df.count()`, since we want the total number of unique values across the entire DataFrame? – AMC Mar 08 '20 at 00:54
1

Or you can use numpy to get the unique values in the dataframe:

import pandas as pd
import numpy as np

df = pd.DataFrame({'X' : [2, 2, 3, 3], 'Y' : [2,2,3,4], 'Z' : [2,3,3,8]})

df_unique = np.unique(np.array(df))

print(df_unique) 
#Output [2 3 4 8]
#for the count of days:
print(len(df_unique))
#Output 4
Kempie
  • 263
  • 2
  • 10
0

How about:

Assuming this is your initial df:

   col1  col2  col3
0     2     2     2
1     2     2     2
2     3     3     3

Then:

count_df = pd.DataFrame()

for i in df.columns:
    df2 = df[i].value_counts()
    count_df = pd.concat([count_df, df2], axis=1)

final_df = count_df.sum(axis=1)
final_df = pd.DataFrame(data=final_df, columns=['Occurrences'])
print(final_df)

   Occurrences
2            6
3            3
teoeme139
  • 412
  • 3
  • 11
  • Edit: Converted series to df to make it look tidier – teoeme139 Mar 05 '20 at 21:24
  • As your question has changed, this answer no longer applies. – teoeme139 Mar 05 '20 at 21:58
  • Repeatedly concatenating like that can destroy performance, it's probably best to add all the parts to a temporary data structure and concat once. Although the ideal would be to avoid explicitly looping entirely, it's completely unnecessary. – AMC Mar 06 '20 at 01:59
  • Even when his question was changed after my answer, I tested 3 different approaches and this was the fastest. I tested with a 10,000,000 x 3 dataframe and got a faster time with this method than using stack() or to_numpy.flatten() and counting occurrences. I disagree with your downvote here – teoeme139 Mar 06 '20 at 15:07
  • That’s certainly unexpected, I’ll take a look at the performance in a bit. – AMC Mar 06 '20 at 17:06
  • Please, share results. It took 0.8 seconds to run the above code whereas it took 1.6ish seconds to run the other two alternatives. – teoeme139 Mar 06 '20 at 17:15
  • Did you make any tweaks to the various solutions in order to benchmark them? Yours returns both the unique values and their frequency, not just the total number of unique values. – AMC Mar 06 '20 at 19:07
  • Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/209179/discussion-between-teoeme139-and-amc). – teoeme139 Mar 06 '20 at 19:21
0

You can use pandas.unique() like so:

pd.unique(df.to_numpy().flatten())

I have done some basic benchmarking, this method appears to be the fastest.

AMC
  • 2,642
  • 7
  • 13
  • 35
  • adding the benchmarking to the post would be useful – baxx Mar 08 '20 at 00:09
  • 1
    @baxx Yes, I intend on doing so tomorrow! – AMC Mar 08 '20 at 00:52
  • I'd be interested on the use of `to_numpy()` over `values` here as well, perhaps it's just whatever was first to mind. I couldn't think of a reason for one over the other though (perhaps that will be clear with the benchmarks?) – baxx Mar 08 '20 at 00:57
  • 1
    @baxx It isn’t for performance reasons, although I am now curious to know if there is one, you can find a good explanation of `.values` and `.to_numpy()` [here](https://stackoverflow.com/a/54508052/11301900). – AMC Mar 08 '20 at 01:08
  • ah, I hadn't seen that information, thanks for the link. I'll stop using values, and think that's good reason to use `to_numpy` instead. – baxx Mar 08 '20 at 01:45
  • @baxx Until I do manage to write a proper answer, you can find some of my benchmarks posted [here](https://chat.stackoverflow.com/rooms/209179/discussion-between-teoeme139-and-amc). – AMC Mar 09 '20 at 02:23