2

I have a DataFrame containing certain words with a unique identifier time that is ascending.

For example:

words = ["a", "b", "a", "c", "d", "a", "b"]
df = pandas.DataFrame(words, columns=["word"])
df.index.name = "time"
     word
time     
0       a
1       b
2       a
3       c
4       d
5       a
6       b

I would like to count the number of unique values in the word column over time, to get a table like this:

     word  rolling_unique_word_count
time                                
0       a                          1
1       b                          2
2       a                          2
3       c                          3
4       d                          4
5       a                          4
6       b                          4

Are there any build-in functions to perform this task?

Peter
  • 96
  • 9

3 Answers3

3

Here Series.expanding working with numeric only, so column was factorized first and then is used custom function for count number of unique values:

s = pd.Series(pd.factorize(df['word'])[0], index=df.index)
df['new'] = s.expanding().apply(lambda x: len(np.unique(x)))
#alternative
#df['new'] = s.expanding().apply(lambda x: len(set(x)))
print (df)
     word  new
time          
0       a  1.0
1       b  2.0
2       a  2.0
3       c  3.0
4       d  4.0
5       a  4.0
6       b  4.0
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • Will it not be more efficient to use the `set.add` feature instead of taking the set at every iteration? - something like this `newcol = []; tmpset = set(); for v in df['word']: tmpset.add(v); newcol.append(len(tmpset)); df['new'] = newcol` – oskros Nov 30 '20 at 11:37
2

One liner using .duplicated

df['new'] = (~df['word'].duplicated()).cumsum()

Output

        word  new
time          
0       a    1
1       b    2
2       a    2
3       c    3
4       d    4
5       a    4
6       b    4
Leonardus Chen
  • 1,103
  • 6
  • 20
1
def sum_unique(x):
    already_in = []
    list_acc = []
    counter = 0
    for i in x:
        if i in already_in:
            list_acc.append(counter)
        else:
            counter+=1
            list_acc.append(counter)
        already_in.append(i)

    return list_acc

df["unique"] = sum_unique(df.word)

#output

    word unique
t       
0    a     1
1    b     2
2    a     2
3    c     3
4    d     4
5    a     4
6    b     4

I know it isn't the most pretty or optimal way. But it does the job.

It might not be fast at all, but it works as needed.

ombk
  • 2,036
  • 1
  • 4
  • 16