2

I have a dataframe df:

import pandas as pd
s = {'id': [243,243, 243, 243, 443,443,443, 332,334,332,332, 333],
 'col':[1,1,1,1,1,1,1,2,2,2,2,2],
 'st': [1,3,5,9,12, 18,23, 1,2,4,8,14],
 'value':[2.4, 3.8, 3.7, 5.6, 1.2, 0.2, 2.1, 2.0, 2.5, 3.4, 1.2, 2.4]}
df = pd.DataFrame(s)

It looks like:

id      col  st  value
0   243    1   1    2.4
1   243    1   3    3.8
2   243    1   5    3.7
3   243    1   9    5.6
4   443    1  12    1.2
5   443    1  18    0.2
6   443    1  23    2.1
7   332    2   1    2.0
8   334    2   2    2.5
9   332    2   4    3.4
10  332    2   8    1.2
11  333    2  14    2.4

The data have two groups col 1 and 2(in real data many groups). I want to include the missing records on the basis of the st column. and the values must be kept as 0.

My output must look like

id  col  st  value
243    1   1    2.4
0      1   2     0
243    1   3    3.8
0      1   4     0
243    1   5    3.7

and so on

332    2   1    2.0
334    2   2    2.5
0      2   3     0
332    2   4    3.4
0      2   5     0
0      2   6     0
0      2   7     0
332    2   8    1.2

How can I do this in pandas ?

Cohan
  • 4,384
  • 2
  • 22
  • 40
Archit
  • 542
  • 1
  • 4
  • 15

3 Answers3

2

Use DataFrame.reindex per groups with GroupBy.apply and range:

df = (df.set_index('st')
        .groupby('col')['id','value']
        .apply(lambda x: x.reindex(range(x.index.min(), x.index.max() + 1), fill_value=0))
        .reset_index()
       )

print (df)
    col  st   id  value
0     1   1  243    2.4
1     1   2    0    0.0
2     1   3  243    3.8
3     1   4    0    0.0
4     1   5  243    3.7
5     1   6    0    0.0
6     1   7    0    0.0
7     1   8    0    0.0
8     1   9  243    5.6
9     1  10    0    0.0
10    1  11    0    0.0
11    1  12  443    1.2
12    1  13    0    0.0
13    1  14    0    0.0
14    1  15    0    0.0
15    1  16    0    0.0
16    1  17    0    0.0
17    1  18  443    0.2
18    1  19    0    0.0
19    1  20    0    0.0
20    1  21    0    0.0
21    1  22    0    0.0
22    1  23  443    2.1
23    2   1  332    2.0
24    2   2  334    2.5
25    2   3    0    0.0
26    2   4  332    3.4
27    2   5    0    0.0
28    2   6    0    0.0
29    2   7    0    0.0
30    2   8  332    1.2
31    2   9    0    0.0
32    2  10    0    0.0
33    2  11    0    0.0
34    2  12    0    0.0
35    2  13    0    0.0
36    2  14  333    2.4
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
1

Method using unnesting , first create the range by using groupby + agg , then we just need explode it and merge

s=df.groupby(['id','col']).st.agg(['min','max'])
s['st']=[ list(range(x,y+1)) for x , y in zip(s['min'],s['max'])]
newdf=unnesting(s.drop(['min','max'],1).reset_index(),['st']).merge(df,how='left').fillna(0)

def unnesting(df, explode):
    idx = df.index.repeat(df[explode[0]].str.len())
    df1 = pd.concat([
        pd.DataFrame({x: np.concatenate(df[x].values)}) for x in explode], axis=1)
    df1.index = idx

    return df1.join(df.drop(explode, 1), how='left')
BENY
  • 317,841
  • 20
  • 164
  • 234
1

Make a dataframe then merge.

You can take advantage of the fact that merge automatically merges on commonly named columns. So our goal will be to create a dataframe with just the columns we need and the values we want.

  • Columns we need: 'col' and 'st'
  • Values we need: each unique value in 'col' and all integer values from the minimum to the maximum in its respective group. Mind that I just grabbed the maximum and assumed the minimum was 1
  • Merge: I made sure to use a 'left' merge to preserve all the pretty values I just created.

m = pd.DataFrame(
    [(i, j) for i, J in df.groupby('col').st.max().items()
     for j in range(1, J + 1)],
    columns=['col', 'st']
)

m.merge(df, 'left').fillna(0)

Output

    col  st     id  value
0     1   1  243.0    2.4
1     1   2    0.0    0.0
2     1   3  243.0    3.8
3     1   4    0.0    0.0
4     1   5  243.0    3.7
5     1   6    0.0    0.0
6     1   7    0.0    0.0
7     1   8    0.0    0.0
8     1   9  243.0    5.6
9     1  10    0.0    0.0
10    1  11    0.0    0.0
11    1  12  443.0    1.2
12    1  13    0.0    0.0
13    1  14    0.0    0.0
14    1  15    0.0    0.0
           ...
22    1  23  443.0    2.1
23    2   1  332.0    2.0
24    2   2  334.0    2.5
25    2   3    0.0    0.0
26    2   4  332.0    3.4
27    2   5    0.0    0.0
28    2   6    0.0    0.0
29    2   7    0.0    0.0
30    2   8  332.0    1.2
31    2   9    0.0    0.0
32    2  10    0.0    0.0
33    2  11    0.0    0.0
34    2  12    0.0    0.0
35    2  13    0.0    0.0
36    2  14  333.0    2.4

We can be a bit more careful and use agg to get min and max then preserve dtype by using the argument downcast='infer' in the call to fillna.

m = pd.DataFrame(
    [(i, j) for i, Mn, Mx in df.groupby('col').st.agg(['min', 'max']).itertuples()
     for j in range(Mn, Mx + 1)],
    columns=['col', 'st']
)

m.merge(df, 'left').fillna(0, downcast='infer')

Output

    col  st   id  value
0     1   1  243    2.4
1     1   2    0    0.0
2     1   3  243    3.8
3     1   4    0    0.0
4     1   5  243    3.7
5     1   6    0    0.0
6     1   7    0    0.0
7     1   8    0    0.0
8     1   9  243    5.6
9     1  10    0    0.0
10    1  11    0    0.0
11    1  12  443    1.2
12    1  13    0    0.0
13    1  14    0    0.0
14    1  15    0    0.0
           ...
22    1  23  443    2.1
23    2   1  332    2.0
24    2   2  334    2.5
25    2   3    0    0.0
26    2   4  332    3.4
27    2   5    0    0.0
28    2   6    0    0.0
29    2   7    0    0.0
30    2   8  332    1.2
31    2   9    0    0.0
32    2  10    0    0.0
33    2  11    0    0.0
34    2  12    0    0.0
35    2  13    0    0.0
36    2  14  333    2.4
piRSquared
  • 285,575
  • 57
  • 475
  • 624