2

I need some tips to make a calculation.

I have a DataFrame that looks like the following:

text_id      user     date        important_words
1            John    2018-01-01   {cat, dog, puppy}           
1            John    2018-02-01   {cat, dog}
2            Anne    2018-01-01   {flower, sun}
3            John    2018-03-01   {water, blue}
3            Marie   2018-05-01   {water, blue, ocean}
3            Kate    2018-08-01   {island, sand, towel}
4            Max     2018-01-01   {hot, cold}
4            Ethan   2018-06-01   {hot, warm}
5            Marie   2019-01-01   {boo}

In the given dataframe:

the text_id refers to the id of a text: each text with a different id is a different text. The user column refers to the name of the user that has edited the text (adding and erasing important words). The date column refers to the moment in which the edit was made (note that edits on each text are temporarilly sorted). Finally, the important_words column is a set of important words present in the text after the edit of the user.

I need to calculate how many words were added by each user on each edition of a page.

The expected output here would be:

text_id      user     date        important_words        added_words
1            John    2018-01-01   {cat, dog, puppy}      3        
1            John    2018-02-01   {cat, dog}             0
2            Anne    2018-01-01   {flower, sun}          2
3            John    2018-03-01   {water, blue}          2
3            Marie   2018-05-01   {water, blue, ocean}   1
3            Kate    2018-08-01   {island, sand, towel}  3
4            Max     2018-01-01   {hot, cold}            2
4            Ethan   2018-06-01   {hot, warm}            1
5            Marie   2019-01-01   {boo}                  1

Note that the first time editing the text is the creation, so the number of words added is always the size of the important_words set in that case.

Any tips on what would be the fastest way to compute the added_words column will be highly appreciated.

Note that the important_words column contains a set, thus the operation of calculating the difference among two consecutive editions should be easy.

HRDSL
  • 711
  • 1
  • 5
  • 22

1 Answers1

2

Hard to think but interesting :-) I am using get_dummies, then we just keep the first 1 value per columns and sum them

s=df.important_words.map(','.join).str.get_dummies(sep=',')
s.mask(s==0).cumsum().eq(1).sum(1)
Out[247]: 
0    3
1    0
2    2
3    2
4    1
5    3
6    2
7    1
8    1
dtype: int64
df['val']=s.mask(s==0).cumsum().eq(1).sum(1)

Update

s=df.important_words.map(','.join).str.get_dummies(sep=',')
s.mask(s==0).groupby(df['text_id']).cumsum().eq(1).sum(1)
BENY
  • 317,841
  • 20
  • 164
  • 234
  • 1
    @Datanovice try the data input :-) – BENY Aug 13 '19 at 14:50
  • it was my fault, the sep needed a space with the comma! sorry, amazing code as always. – Umar.H Aug 13 '19 at 14:51
  • Wow its really good. Question is.. it must count the added words on each text. Where in the code is that implemented? I'm also getting this error when trying it on jupyter notebook: "A value is trying to be set on a copy of a slice from a DataFrame. Try using .loc[row_indexer,col_indexer] = value instead" – HRDSL Aug 13 '19 at 15:14
  • @HRDSL https://stackoverflow.com/questions/20625582/how-to-deal-with-settingwithcopywarning-in-pandas – BENY Aug 13 '19 at 15:41
  • @HRDSL that is get_dummy , if it add it will show 1 if not it will show 0 – BENY Aug 13 '19 at 15:49
  • Yes but the thing is: I've tried with another example and it counts the number of words added among two texts with different ids. I want the operation to count only the words added to the SAME text (that means, same id). – HRDSL Aug 14 '19 at 12:54