Rolling unique value count in pandas

Question

I have a DataFrame containing certain words with a unique identifier time that is ascending.

For example:

words = ["a", "b", "a", "c", "d", "a", "b"]
df = pandas.DataFrame(words, columns=["word"])
df.index.name = "time"

I would like to count the number of unique values in the word column over time, to get a table like this:

     word  rolling_unique_word_count
time                                
0       a                          1
1       b                          2
2       a                          2
3       c                          3
4       d                          4
5       a                          4
6       b                          4

Are there any build-in functions to perform this task?

score 3 · Answer 1 · answered Nov 30 '20 at 11:33

3

Here Series.expanding working with numeric only, so column was factorized first and then is used custom function for count number of unique values:

s = pd.Series(pd.factorize(df['word'])[0], index=df.index)
df['new'] = s.expanding().apply(lambda x: len(np.unique(x)))
#alternative
#df['new'] = s.expanding().apply(lambda x: len(set(x)))
print (df)
     word  new
time          
0       a  1.0
1       b  2.0
2       a  2.0
3       c  3.0
4       d  4.0
5       a  4.0
6       b  4.0

answered Nov 30 '20 at 11:33

jezrael

822,522
95
1,334
1,252

Will it not be more efficient to use the `set.add` feature instead of taking the set at every iteration? - something like this `newcol = []; tmpset = set(); for v in df['word']: tmpset.add(v); newcol.append(len(tmpset)); df['new'] = newcol` – oskros Nov 30 '20 at 11:37

Leonardus Chen · Accepted Answer · 2020-11-30T20:07:44.033

2

One liner using .duplicated

df['new'] = (~df['word'].duplicated()).cumsum()

Output

        word  new
time          
0       a    1
1       b    2
2       a    2
3       c    3
4       d    4
5       a    4
6       b    4

edited Nov 30 '20 at 20:07

answered Nov 30 '20 at 19:54

Leonardus Chen

1,103
6
20

Very fast and elegant! – Peter Dec 01 '20 at 20:33

score 1 · Answer 3 · answered Nov 30 '20 at 11:40

def sum_unique(x):
    already_in = []
    list_acc = []
    counter = 0
    for i in x:
        if i in already_in:
            list_acc.append(counter)
        else:
            counter+=1
            list_acc.append(counter)
        already_in.append(i)

    return list_acc

df["unique"] = sum_unique(df.word)

#output

    word unique
t       
0    a     1
1    b     2
2    a     2
3    c     3
4    d     4
5    a     4
6    b     4

I know it isn't the most pretty or optimal way. But it does the job.

It might not be fast at all, but it works as needed.

Rolling unique value count in pandas

3 Answers3

Linked