1

I have a data set which contains something like this:

SNo  Cookie
1       A
2       A
3       A
4       B
5       C
6       D
7       A
8       B
9       D
10      E
11      D
12      A

So lets say we have 5 cookies 'A,B,C,D,E'. Now I want to count if any cookie has reoccurred after a new cookie was encountered. For example, in the above example, cookie A was encountered again at 7th place and then at 12th place also. NOTE We wouldn't count A at 2nd place as it came simultaneously, but at position 7th and 12th we had seen many new cookies before seeing A again, hence we count that instance. So essentially I want something like this:

Sno Cookie  Count
 1     A     2
 2     B     1
 3     C     0
 4     D     2
 5     E     0

Can anyone give me logic or python code behind this?

DYZ
  • 55,249
  • 10
  • 64
  • 93
Kshitij Yadav
  • 1,357
  • 1
  • 15
  • 35
  • How your data set looks like in a representation is not as interesting as a piece of code which sets a suitable data structure up to contain it. This is not the same as making a [mcve], but similar and serves similar purposes. – Yunnosch Aug 28 '18 at 20:33

3 Answers3

3

One way to do this would be to first get rid of consecutive Cookies, then find where the Cookie has been seen before using duplicated, and finally groupby cookie and get the sum:

no_doubles = df[df.Cookie != df.Cookie.shift()]

no_doubles['dups'] = no_doubles.Cookie.duplicated()

no_doubles.groupby('Cookie').dups.sum()

This gives you:

Cookie
A    2.0
B    1.0
C    0.0
D    2.0
E    0.0
Name: dups, dtype: float64
sacuL
  • 49,704
  • 8
  • 81
  • 106
  • Hey thanks.. But I think your answer works for cookie comes simultaneously 2 times, what if it comes more than 2 times lets say 5 times? Then what will be the logic? – Kshitij Yadav Aug 28 '18 at 20:34
  • That will still work, because the code to create `no_doubles` will get rid of consecutive Cookies, regardless of whether there are 2 or 200000 consecutively – sacuL Aug 28 '18 at 20:36
  • Man! You just saved me my job. That worked so smoothly! Thanks a ton buddy :) – Kshitij Yadav Aug 28 '18 at 20:43
  • Done! Thank you. Could you also help me in upvoting my answer? I am new here and had 2 bad questions asking by the community. It will help me. Thank you. – Kshitij Yadav Aug 28 '18 at 20:57
  • You already have my +1 (this I guess was counter-acted by someone else's downvote...) – sacuL Aug 28 '18 at 20:58
  • Hey could you see this question: https://stackoverflow.com/questions/52083723/count-re-occurrence-of-a-value-in-python-aggregated-with-respect-to-another-valu – Kshitij Yadav Aug 29 '18 at 18:08
2

Start by removing consecutive duplicates, then count the survivers:

no_dups = df[df.Cookie != df.Cookie.shift()] # Borrowed from @sacul
no_dups.groupby('Cookie').count() - 1
#        SNo
#Cookie     
#A         2
#B         1
#C         0
#D         2
#E         0
DYZ
  • 55,249
  • 10
  • 64
  • 93
  • DYZ Can a code like this do the count here:?`df.groupby('Cookie').size().reset_index(name='Count')` – Sai Kumar Aug 28 '18 at 21:22
  • 1
    Your code will not eliminate consecutive duplicates. – DYZ Aug 28 '18 at 21:23
  • DYZ can you help solving me this: https://stackoverflow.com/questions/52083723/count-re-occurrence-of-a-value-in-python-aggregated-with-respect-to-another-valu – Kshitij Yadav Aug 29 '18 at 18:09
1

pandas.factorize and numpy.bincount

  1. If immediately repeated values are not counted then remove them.
  2. Do a normal counting of values on what's left.
  3. However, that is one more than what is asked for, so subtract one.

  1. factorize
  2. Filter out immediate repeats
  3. bincount
  4. Produce pandas.Series

i, r = pd.factorize(df.Cookie)
mask = np.append(True, i[:-1] != i[1:])
cnts = np.bincount(i[mask]) - 1

pd.Series(cnts, r)

A    2
B    1
C    0
D    2
E    0
dtype: int64

pandas.value_counts

zip cookies with its lagged self, pulling out non repeats

c = df.Cookie.tolist()

pd.value_counts([a for a, b in zip(c, [None] + c) if a != b]).sort_index() - 1

A    2
B    1
C    0
D    2
E    0
dtype: int64

defaultdict

from collections import defaultdict

def count(s):
  d = defaultdict(lambda:-1)
  x = None
  for y in s:
    d[y] += y != x
    x = y

  return pd.Series(d)

count(df.Cookie)

A    2
B    1
C    0
D    2
E    0
dtype: int64
piRSquared
  • 285,575
  • 57
  • 475
  • 624