missing date columns in pandas dataframe after using groupby

Question

I have a dataframe which I am creating by reading an Excel file:

Project Release Name    Cycle Name  Cycle Start Date    Cycle End Date  Exec Date   Planned Exec Date   Available Test Cases    Planned Tested  Passed  Failed  Blocked No Run  Tester
B1  Y1  CM1 2/7/2018    2/20/2018   2/6/2018    2/6/2018    2   10  8   8   0   0   0   Tester3
B1  Y1  CM1 2/7/2018    2/20/2018   2/7/2018    2/7/2018    2   13  10  9   1   1   0   Tester3
B1  Y1  CM1 2/7/2018    2/20/2018   2/8/2018    2/8/2018    0   1   1   1   0   0   0   Tester3
B1  Y1  CM1 2/7/2018    2/20/2018   2/9/2018    2/9/2018    0   2   2   2   0   0   0   Tester3
B1  Y1  CM1 2/7/2018    2/20/2018   2/10/2018   2/10/2018   0   2   2   2   0   0   0   Tester3
B1  Y1  CL1 2/7/2018    2/25/2018   2/1/2018    2/1/2018    5   25  20  20  0   0   0   Tester 4
B1  Y1  CL1 2/7/2018    2/25/2018   2/2/2018    2/2/2018    10  30  20  18  2   0   0   Tester 4
B1  Y1  CL1 2/7/2018    2/25/2018   2/3/2018    2/3/2018    0   2   2   0   2   0   0   Tester 4
B1  Y1  CL1 1/17/2018   2/25/2018   2/4/2018    2/4/2018    0   3   3   1   2   0   0   Tester 4
B1  Y1  CL1 1/17/2018   2/25/2018   2/5/2018    2/5/2018    5   32  25  20  4   1   0   Tester 4
C1  Z1  CK1 1/10/2018   2/20/2018   2/3/2018    2/3/2018    0   1   1   0   1   0   0   Tester5
C1  Z1  CK1 1/10/2018   2/20/2018   2/4/2018    2/4/2018    0   1   1   0   1   0   0   Tester5
C1  Z1  CK1 1/10/2018   2/20/2018   2/5/2018    2/5/2018    0   1   1   0   1   0   0   Tester5
C1  Z1  CK1 1/10/2018   2/20/2018   2/6/2018    2/6/2018    0   1   1   1   0   0   0   Tester5
C1  Z1  CK1 1/10/2018   2/20/2018   2/7/2018    2/7/2018    0   1   1   1   0   0   0   Tester6
C1  Z1  CK1 1/10/2018   2/20/2018   2/8/2018    2/8/2018    0   1   1   1   0   0   0   Tester6
C1  Z1  CK2 1/17/2018   2/18/2018   2/6/2018    2/6/2018    0   1   1   1   0   0   0   Tester6
C1  Z1  CK2 1/17/2018   2/18/2018   2/7/2018    2/7/2018    0   2   2   0   2   0   0   Tester6
C1  Z1  CK2 1/17/2018   2/18/2018   2/8/2018    2/8/2018    0   2   2   0   2   0   0   Tester7
C1  Z1  CK2 1/17/2018   2/18/2018   2/9/2018    2/9/2018    0   2   2   0   2   0   0   Tester7
C1  Z1  CK2 1/17/2018   2/18/2018   2/10/2018   2/10/2018   0   2   2   1   1   0   0   Tester7
C1  Z1  CK2 1/17/2018   2/18/2018   2/11/2018   2/11/2018   0   2   2   2   0   0   0   Tester7

And I am using pandas groupby as follows:

dx1 = pd.read_excel('Trend.xlsx',sheetname='Execution by Date')

dx1 = dx1.groupby(['Project', 'Release Name', 'Cycle Name', 'Cycle Start Date',
       'Cycle End Date'])['Exec Date','Planned Exec Date', 'Available Test Cases', 'Planned', 'Tested', 'Passed', 'Failed',
       'Blocked'].sum().reset_index()

and here is the result I get:

Project Release Name    Cycle Name  Cycle Start Date    Cycle End Date  Available Test Cases    Planned Tested  Passed  Failed  Blocked
B1  Y1  CL1 2018-01-17 00:00:00 2018-02-25 00:00:00 5   35  28  21  6   1
B1  Y1  CL1 2018-02-07 00:00:00 2018-02-25 00:00:00 15  57  42  38  4   0
B1  Y1  CM1 2018-02-07 00:00:00 2018-02-20 00:00:00 4   28  23  22  1   1
C1  Z1  CK1 2018-01-10 00:00:00 2018-02-20 00:00:00 0   6   6   3   3   0
C1  Z1  CK2 2018-01-17 00:00:00 2018-02-18 00:00:00 0   11  11  4   7   0

As you can see that, 'Exec Date' and 'Planned Exec Date' are missing.

How can I bring both of the missing date columns back into the dataframe. I have tried all seemingly relevant solutions but none worked for me.

Welcome to SO. Please provide a [mcve]. In addition, please read through: [How to make good reproducible pandas examples](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) — jpp, Feb 14 '18 at 17:41

score 0 · Answer 1 · answered Feb 15 '18 at 08:26

You can't. You're grouping by

['Project', 'Release Name', 'Cycle Name', 'Cycle Start Date', 'Cycle End Date']

for which each combination have multiple different values for Exec Date and Planned Exec Date. In other words, you have e.g. 3 different values, and you can only keep one. Seems like groupby() will not choose any of them for you, and simply leave the column out of the result. You can do it manually, however, and then merge them into your groupby() results:

import pandas as pd
pd.set_option("display.width", 300)
import sys

# Read in data set
if sys.version_info[0] < 3:
    from StringIO import StringIO
else:
    from io import StringIO

test_data = StringIO("""Project;Release Name;Cycle Name;Cycle Start Date;Cycle End Date;Exec Date;Planned Exec Date;Available Test Cases;Planned;Tested;Passed;Failed;Blocked;No Run;Tester
        B1;Y1;CM1;2/7/2018;2/20/2018;2/6/2018;2/6/2018;2;10;8;8;0;0;0;Tester3
        B1;Y1;CM1;2/7/2018;2/20/2018;2/7/2018;2/7/2018;2;13;10;9;1;1;0;Tester3
        B1;Y1;CM1;2/7/2018;2/20/2018;2/8/2018;2/8/2018;0;1;1;1;0;0;0;Tester3
        B1;Y1;CM1;2/7/2018;2/20/2018;2/9/2018;2/9/2018;0;2;2;2;0;0;0;Tester3
        B1;Y1;CM1;2/7/2018;2/20/2018;2/10/2018;2/10/2018;0;2;2;2;0;0;0;Tester3
        B1;Y1;CL1;2/7/2018;2/25/2018;2/1/2018;2/1/2018;5;25;20;20;0;0;0;Tester4
        B1;Y1;CL1;2/7/2018;2/25/2018;2/2/2018;2/2/2018;10;30;20;18;2;0;0;Tester4
        B1;Y1;CL1;2/7/2018;2/25/2018;2/3/2018;2/3/2018;0;2;2;0;2;0;0;Tester4
        B1;Y1;CL1;1/17/2018;2/25/2018;2/4/2018;2/4/2018;0;3;3;1;2;0;0;Tester4
        B1;Y1;CL1;1/17/2018;2/25/2018;2/5/2018;2/5/2018;5;32;25;20;4;1;0;Tester4
        C1;Z1;CK1;1/10/2018;2/20/2018;2/3/2018;2/3/2018;0;1;1;0;1;0;0;Tester5
        C1;Z1;CK1;1/10/2018;2/20/2018;2/4/2018;2/4/2018;0;1;1;0;1;0;0;Tester5
        C1;Z1;CK1;1/10/2018;2/20/2018;2/5/2018;2/5/2018;0;1;1;0;1;0;0;Tester5
        C1;Z1;CK1;1/10/2018;2/20/2018;2/6/2018;2/6/2018;0;1;1;1;0;0;0;Tester5
        C1;Z1;CK1;1/10/2018;2/20/2018;2/7/2018;2/7/2018;0;1;1;1;0;0;0;Tester6
        C1;Z1;CK1;1/10/2018;2/20/2018;2/8/2018;2/8/2018;0;1;1;1;0;0;0;Tester6
        C1;Z1;CK2;1/17/2018;2/18/2018;2/6/2018;2/6/2018;0;1;1;1;0;0;0;Tester6
        C1;Z1;CK2;1/17/2018;2/18/2018;2/7/2018;2/7/2018;0;2;2;0;2;0;0;Tester6
        C1;Z1;CK2;1/17/2018;2/18/2018;2/8/2018;2/8/2018;0;2;2;0;2;0;0;Tester7
        C1;Z1;CK2;1/17/2018;2/18/2018;2/9/2018;2/9/2018;0;2;2;0;2;0;0;Tester7
        C1;Z1;CK2;1/17/2018;2/18/2018;2/10/2018;2/10/2018;0;2;2;1;1;0;0;Tester7
        C1;Z1;CK2;1/17/2018;2/18/2018;2/11/2018;2/11/2018;0;2;2;2;0;0;0;Tester7""")

df = pd.read_csv(test_data, sep=";")
new_df = df.groupby(['Project', 'Release Name', 'Cycle Name', 'Cycle Start Date', 'Cycle End Date'])['Exec Date','Planned Exec Date', 'Available Test Cases', 'Planned', 'Tested', 'Passed', 'Failed', 'Blocked'].sum().reset_index()
print new_df

new_df looks like:

      Project Release Name Cycle Name Cycle Start Date Cycle End Date  Available Test Cases  Planned  Tested  Passed  Failed  Blocked
0          B1           Y1        CL1        1/17/2018      2/25/2018                     5       35      28      21       6        1
1          B1           Y1        CL1         2/7/2018      2/25/2018                    15       57      42      38       4        0
2          B1           Y1        CM1         2/7/2018      2/20/2018                     4       28      23      22       1        1
3          C1           Z1        CK1        1/10/2018      2/20/2018                     0        6       6       3       3        0
4          C1           Z1        CK2        1/17/2018      2/18/2018                     0       11      11       4       7        0

You can then do the groupby() again, but keep only the first occurrence. Now the missing columns will be present, since there are no ambiguity in the columns:

# Get first occurrence of "Exec Date" and "Planned Exec Date"
firsts = df.groupby(['Project', 'Release Name', 'Cycle Name', 'Cycle Start Date', "Cycle End Date"]).first().reset_index()
print firsts

firsts looks like:

      Project Release Name Cycle Name Cycle Start Date Cycle End Date Exec Date Planned Exec Date  Available Test Cases  Planned  Tested  Passed  Failed  Blocked  No Run   Tester
0          B1           Y1        CL1        1/17/2018      2/25/2018  2/4/2018          2/4/2018                     0        3       3       1       2        0       0  Tester4
1          B1           Y1        CL1         2/7/2018      2/25/2018  2/1/2018          2/1/2018                     5       25      20      20       0        0       0  Tester4
2          B1           Y1        CM1         2/7/2018      2/20/2018  2/6/2018          2/6/2018                     2       10       8       8       0        0       0  Tester3
3          C1           Z1        CK1        1/10/2018      2/20/2018  2/3/2018          2/3/2018                     0        1       1       0       1        0       0  Tester5
4          C1           Z1        CK2        1/17/2018      2/18/2018  2/6/2018          2/6/2018                     0        1       1       1       0        0       0  Tester6

Then merge the initial groupby() result (the one with the sums) with the groupby() results containing the missing columns:

# Merge in the missing columns into the result from the groupby
new_df_with_missing_columns = new_df.merge(firsts[["Project", "Release Name", "Cycle Name", "Cycle Start Date", "Cycle End Date", "Exec Date", "Planned Exec Date"]], on=["Project", "Release Name", "Cycle Name", "Cycle Start Date", "Cycle End Date"])
print new_df_with_missing_columns

new_df_with_missing_columns looks like:

      Project Release Name Cycle Name Cycle Start Date Cycle End Date  Available Test Cases  Planned  Tested  Passed  Failed  Blocked Exec Date Planned Exec Date
0          B1           Y1        CL1        1/17/2018      2/25/2018                     5       35      28      21       6        1  2/4/2018          2/4/2018
1          B1           Y1        CL1         2/7/2018      2/25/2018                    15       57      42      38       4        0  2/1/2018          2/1/2018
2          B1           Y1        CM1         2/7/2018      2/20/2018                     4       28      23      22       1        1  2/6/2018          2/6/2018
3          C1           Z1        CK1        1/10/2018      2/20/2018                     0        6       6       3       3        0  2/3/2018          2/3/2018
4          C1           Z1        CK2        1/17/2018      2/18/2018                     0       11      11       4       7        0  2/6/2018          2/6/2018

Thanks for your response and I am trying to find if we can do something like in this post: https://stackoverflow.com/questions/11391969/how-to-group-pandas-dataframe-entries-by-date-in-a-non-unique-column The difference is that this post addresses the year component of date and there are fewer columns. — novastar, Feb 16 '18 at 07:55

missing date columns in pandas dataframe after using groupby

1 Answers1