2

I have a csv file with dates, repair_id, number of onsite repairs and number of offsite repairs, so that my data looks as:

data        repair_id    num_onsite     num_offsite
2016-02-01          A             3              0
2016-02-01          B             2              1
2016-02-01          D             0              4
2016-02-02          A             1              3
2016-02-02          C             1              1
2016-02-02          E             0              6
...
2016-02-14          A             1              3
2016-02-14          B             0              4
2016-02-14          D             2              0
2016-02-14          E             3              0

There are 5 different repair_id, namely: A, B, C, D, E. If a repair man (repair_id) had no work on a given date then they are not in the csv file for that date. I would like to change that by including them and have a 0 value for num_onsite and num_offsite so that my table would resemble:

data        repair_id    num_onsite     num_offsite
2016-02-01          A             3              0
2016-02-01          B             2              1
2016-02-01          C             0              0 # added
2016-02-01          D             0              4
2016-02-01          E             0              0 # added
2016-02-02          A             1              3
2016-02-02          B             0              0 # added
2016-02-02          C             1              1
2016-02-02          D             0              0 # added
2016-02-02          E             0              6
...
2016-02-14          A             1              3
2016-02-14          B             0              4
2016-02-14          C             0              0 # added
2016-02-14          D             2              0
2016-02-14          E             3              0

I've had a look at:

Pandas DataFrame insert / fill missing rows from previous dates

Missing data, insert rows in Pandas and fill with NAN

Add missing dates to pandas dataframe

but I wasn't able to get it to output properly

Community
  • 1
  • 1
Lukasz
  • 2,476
  • 10
  • 41
  • 51

3 Answers3

4
df.set_index(["data","repair_id"]).unstack(fill_value=0).stack().reset_index()      



        data repair_id  num_onsite  num_offsite
0 2016-02-01         A         3.0          0.0
1 2016-02-01         B         2.0          1.0
2 2016-02-01         C         0.0          0.0
3 2016-02-01         D         0.0          4.0
4 2016-02-01         E         0.0          0.0
5 2016-02-02         A         1.0          3.0
6 2016-02-02         B         0.0          0.0
7 2016-02-02         C         1.0          1.0
8 2016-02-02         D         0.0          0.0
9 2016-02-02         E         0.0          6.0
dataflow
  • 475
  • 2
  • 12
  • 1
    Great answer, improve it with `fill_value=0`. Like this: `df.set_index(["data","repair_id"]).unstack(fill_value=0).stack().reset_index() ` – piRSquared Mar 01 '17 at 04:50
3

Set the index, reindex with fill_value, then reset_index

mux = pd.MultiIndex.from_product(
    [df.data.unique(), df.repair_id.unique()],
    names=['data', 'repair_id']
)

df.set_index(['data', 'repair_id']).reindex(mux, fill_value=0).reset_index()

          data repair_id  num_onsite  num_offsite
0   2016-02-01         A           3            0
1   2016-02-01         B           2            1
2   2016-02-01         D           0            4
3   2016-02-01         C           0            0
4   2016-02-01         E           0            0
5   2016-02-02         A           1            3
6   2016-02-02         B           0            0
7   2016-02-02         D           0            0
8   2016-02-02         C           1            1
9   2016-02-02         E           0            6
10  2016-02-14         A           1            3
11  2016-02-14         B           0            4
12  2016-02-14         D           2            0
13  2016-02-14         C           0            0
14  2016-02-14         E           3            0
piRSquared
  • 285,575
  • 57
  • 475
  • 624
2

And for those of us with the SQL mentality, consider merging (left join) on a set of all possible date and id combinations:

import itertools
...
combns = pd.DataFrame(list(itertools.product(df['data'].unique(), df['repair_id'].unique())),
                      columns=['data', 'repair_id'])

new_df = combns.merge(df, on=['data', 'repair_id'], how='left')\
               .fillna(0).sort_values(['data', 'repair_id']).reset_index(drop=True)

#           data repair_id  num_onsite  num_offsite
# 0   2016-02-01         A         3.0          0.0
# 1   2016-02-01         B         2.0          1.0
# 2   2016-02-01         C         0.0          0.0
# 3   2016-02-01         D         0.0          4.0
# 4   2016-02-01         E         0.0          0.0
# 5   2016-02-02         A         1.0          3.0
# 6   2016-02-02         B         0.0          0.0
# 7   2016-02-02         C         1.0          1.0
# 8   2016-02-02         D         0.0          0.0
# 9   2016-02-02         E         0.0          6.0
# 10  2016-02-14         A         1.0          3.0
# 11  2016-02-14         B         0.0          4.0
# 12  2016-02-14         C         0.0          0.0
# 13  2016-02-14         D         2.0          0.0
# 14  2016-02-14         E         3.0          0.0
Parfait
  • 104,375
  • 17
  • 94
  • 125