2

I have a dataframe in Pandas with collected data;

import pandas as pd
df = pd.DataFrame({'Group': ['A','A','A','A','A','A','A','B','B','B','B','B','B','B'], 'Subgroup': ['Blue', 'Blue','Blue','Red','Red','Red','Red','Blue','Blue','Blue','Blue','Red','Red','Red'],'Obs':[1,2,4,1,2,3,4,1,2,3,6,1,2,3]})

+-------+----------+-----+
| Group | Subgroup | Obs |
+-------+----------+-----+
| A     | Blue     |   1 |
| A     | Blue     |   2 |
| A     | Blue     |   4 |
| A     | Red      |   1 |
| A     | Red      |   2 |
| A     | Red      |   3 |
| A     | Red      |   4 |
| B     | Blue     |   1 |
| B     | Blue     |   2 |
| B     | Blue     |   3 |
| B     | Blue     |   6 |
| B     | Red      |   1 |
| B     | Red      |   2 |
| B     | Red      |   3 |
+-------+----------+-----+

The Observations ('Obs') are supposed to be numbered without gaps, but you can see we have 'missed' Blue 3 in group A and Blue 4 and 5 in group B. The desired outcome is a percentage of all 'missed' Observations ('Obs') per group, so in the example:

+-------+--------------------+--------+--------+
| Group | Total Observations | Missed |   %    |
+-------+--------------------+--------+--------+
| A     |                  8 |      1 | 12.5%  |
| B     |                  9 |      2 | 22.22% |
+-------+--------------------+--------+--------+

I tried both with for loops and by using groups (for example:

df.groupby(['Group','Subgroup']).sum()
print(groups.head)

) but I can't seem to get that to work in any way I try. Am I going about this the wrong way?

From another answer (big shoutout to @Lie Ryan) I found a function to look for missing elements, however I don't quite understand how to implement this yet;

def window(seq, n=2):
    "Returns a sliding window (of width n) over data from the iterable"
    "   s -> (s0,s1,...s[n-1]), (s1,s2,...,sn), ...                   "
    it = iter(seq)
    result = tuple(islice(it, n))
    if len(result) == n:
        yield result
    for elem in it:
        result = result[1:] + (elem,)
        yield result

def missing_elements(L):
    missing = chain.from_iterable(range(x + 1, y) for x, y in window(L) if (y - x) > 1)
    return list(missing)

Can anyone give me a pointer is the right direction?

Jean-François Corbett
  • 37,420
  • 30
  • 139
  • 188
Jaap Baanders
  • 55
  • 1
  • 5

3 Answers3

4

Simple enough, you'll need groupby here:

  1. Using groupby + diff, figure out how many observations are missing per Group and SubGroup
  2. Group df on Group, and compute the size and sum of the column computed in the previous step
  3. A couple more straightforward steps (calculating the %) give you your intended output.

f = [   # declare an aggfunc list in advance, we'll need it later
      ('Total Observations', 'size'), 
      ('Missed', 'sum')
]

g = df.groupby(['Group', 'Subgroup'])\
      .Obs.diff()\
      .sub(1)\
      .groupby(df.Group)\
      .agg(f)

g['Total Observations'] += g['Missed']
g['%'] = g['Missed'] / g['Total Observations'] * 100 

g

       Total Observations  Missed          %
Group                                       
A                     8.0     1.0  12.500000
B                     9.0     2.0  22.222222
cs95
  • 379,657
  • 97
  • 704
  • 746
2

A similar approach using groupby, apply and assign:

(
    df.groupby(['Group','Subgroup']).Obs
    .apply(lambda x: [x.max()-x.min()+1, x.max()-x.min()+1-len(x)])
    .apply(pd.Series)
    .groupby(level=0).sum()
    .assign(pct=lambda x: x[1]/x[0]*100)
    .set_axis(['Total Observations', 'Missed', '%'], axis=1, inplace=False)
)

Out[75]: 
       Total Observations  Missed          %
Group                                       
A                       8       1  12.500000
B                       9       2  22.222222
Allen Qin
  • 19,507
  • 8
  • 51
  • 67
2
from collections import Counter

gs = ['Group', 'Subgroup']
old_tups = set(zip(*df.values.T))

missed = pd.Series(Counter(
    g for (g, s), d in df.groupby(gs)
    for o in range(d.Obs.min(), d.Obs.max() + 1)
    if (g, s, o) not in old_tups
), name='Missed')

hit = df.set_index(gs).Obs.count(level=0)
total = hit.add(missed).rename('Total')
ratio = missed.div(total).rename('%')

pd.concat([total, missed, ratio], axis=1).reset_index()

  Group  Total  Missed         %
0     A      8       1  0.125000
1     B      9       2  0.222222
piRSquared
  • 285,575
  • 57
  • 475
  • 624