I would like to compute the co-occurrence percentages after grouping. I am unable to determine the best method for doing so. I can think of ways to brute-force the answers, but this means lots of hard-coded calculations that may break as more source data is added. There must be a more elegant method, but I don't see it. I appreciate any suggestions.
(perhaps a little similar to Python Pandas check if a value occurs more then once in the same day)
Goal: Table of co-occurrence percentages for a data column after grouping. For example: When A occurred, B was found with A 45% of the time in January. When A occurred, C was found with A 21% of the time for week 6.
Sample Data (df):
Date ID Region Event
1/01/2016 1001 S C
1/01/2016 1001 S D
1/01/2016 1001 N E
1/01/2016 1002 E D
1/02/2016 1003 E A
1/04/2016 1005 N B
1/04/2016 1005 N B
1/04/2016 1005 N B
1/04/2016 1006 N A
1/04/2016 1006 N F
2/12/2016 1008 E C
2/12/2016 1008 E B
To calculate the percentages, I need to find Events that happen in with the same ID. So, for the whole dataset C when B is 50%, B isolated is 50% and all others are 0%. But, if i groupby Month, then B isolated is 100% for Jan, and C when B is 100% for Feb.
Currently, I have code using .isin and .drop_duplicates to find and reduce the lists:
b_ids = df[df.Event == 'B'].ID.drop_duplicates()
x = len(b_ids)
c_when_b = df[(df.ID.isin(b_ids)) & (df.Event == 'C')].ID.drop_duplicates()
y = len(c_when_b)
pct_cb = float(x)/y
Problems:
- How can this be extended to all binary combinations of Events (the real data has 25 events)
- How do I modify this for easy grouping by date (week, month, quarter, etc.)?
- How can the Region also be a grouping?
- How can it easily be extended to multiple criteria ( (A | B) & (C | D) )?
- Is there something easy that I'm completely missing? Please let me know if this is unclear. Thanks in advance.
EDIT: Expected output would be a multiple column series for each event for a given time grouping for plotting (ignore these actual numbers):
EVENT A
A B C ...
1 96.19 1.23 2.22
2 96.23 1.56 1.12
3 95.24 2.58 3.02
4 78.98 20.31 1.11
... .... ... ...
EVENT B
A B C ...
1 96.19 1.23 3.33
2 96.23 1.56 1.08
3 95.24 2.58 1.78
4 78.98 20.31 5.12
... .... ... ...