1

Scenario: I have a dataframe with multiple columns. Four of them (A_Id, FF_Id, Date and Output) are relevant.

Objective: I am trying to extract these columns and create a dataframe with the format Index=A_Id, Columns=F_Id, and the values come from the column Output. Whenever there are two or more values of a given combination of A and F, use the column Date to select the latest available.

Data Sample: Each given A_Id entry has a value in combination with each F_Id entry. a given combination can have 2 or more results with different dates . There are 10 unique F_Id, 7000 A_Id. Here is a subset

FId     Date        Output  AId
628     2020/12/31  Yes     1
629     2020/12/31  No      1
080     2020/12/31  No      1
081     2020/12/31  No      1
628     2020/12/31  Yes     2
629     2020/12/31  No      2
080     2020/12/31  No      2
081     2020/12/31  No      2
628     2021/12/31  Yes     3
629     2021/12/31  Yes     3
080     2021/12/31  No      3
081     2021/12/31  No      3
628     2020/12/31  Yes     14
629     2020/12/31  No      14
080     2020/12/31  No      14
081     2020/12/31  No      14
628     2021/12/31  Yes     14
629     2021/12/31  No      14
080     2021/12/31  No      14
081     2021/12/31  No      14
628     2020/12/31  Yes     15
629     2020/12/31  No      15
080     2020/12/31  No      15
081     2020/12/31  Yes     15

Desired ouptut: A Matrix using A_ID as the index and F_Id as columns. Either in one dataframe with 2d, where the first dimension has Output and the Second has the Dates:

    4628    4629    5080    5081    4628        4629        5080        5081
1   Yes     No      No      No      2020/12/31  2020/12/31  2020/12/31  2020/12/31
2   Yes     No      No      No      2020/12/31  2020/12/31  2020/12/31  2020/12/31
3   Yes     Yes     No      No      2021/12/31  2021/12/31  2021/12/31  2021/12/31
14  Yes     No      No      No      2021/12/31  2021/12/31  2021/12/31  2021/12/31
15  Yes     No      No      Yes     2020/12/31  2020/12/31  2020/12/31  2020/12/31

or two different dataframes, of the same size, one with Output and the other with Dates:

    4628    4629    5080    5081
1   Yes     No      No      No
2   Yes     No      No      No
3   Yes     Yes     No      No
14  Yes     No      No      No
15  Yes     No      No      Yes

    4628        4629        5080        5081
1   2020/12/31  2020/12/31  2020/12/31  2020/12/31
2   2020/12/31  2020/12/31  2020/12/31  2020/12/31
3   2021/12/31  2021/12/31  2021/12/31  2021/12/31
14  2021/12/31  2021/12/31  2021/12/31  2021/12/31
15  2020/12/31  2020/12/31  2020/12/31  2020/12/31

Obs: There are cases like A_Id 14, where the Outputs are available for two dates. In this case, I am trying to select the Output value for the latest available date, and that date for reference.

What I tried 1: First I tried passing the Output and Date columns into the pivot_table function.:

# source df is read from the raw data on an excel spreadsheet
list_aid =[source['A_Id']]
list_fid = [source['F_Id']]
list_answer = [source['Output']]

testdf33 = source[['AgentId','FactorId','Answer','Date']].pivot_table(values=['Answer','Date'],
                                                               index='AgentId', 
                                                               columns='FactorId',
                                                               aggfunc={'Answer': lambda x: ' '.join(x), 'Date': lambda x: ' '.join(x)})

Issue 1: When running this, there is the error, because I am passing a Datestime whereas the function was expecting a string.

TypeError: sequence item 0: expected str instance, Timestamp found

What I tried 2: I forcibly changed the data in the Datecolumn into str format:

testdf11 = source[['AgentId','FactorId','Answer','Date']]
testdf11['Date'] = testdf11['Date'].astype(str)


testdf33 = testdf11.pivot_table(values=['Answer','Date'],
                                index='AgentId',columns='FactorId',
                                aggfunc={'Answer': lambda x: ' '.join(x),
                                         'Date': lambda x: ' '.join(x)})

Obs2: This gives me all the data aggregated, but I only need the latest output/dates, not an aggregation of whatever is available.

Question: How can I output only the Output/Date combination for the latest available date?

DGMS89
  • 1,507
  • 6
  • 29
  • 60

2 Answers2

2

You can sort your dataframe by Date first then use pivot_table with first as aggregate function:

>>> (df.sort_values('Date', ascending=True)
       .pivot_table(index='AId', columns='FId',
                    values=['Date', 'Output'],
                    aggfunc='last'))

           Date                                     Output               
FId         80          81          628         629    80   81   628  629
AId                                                                      
1    2020/12/31  2020/12/31  2020/12/31  2020/12/31     No   No  Yes   No
2    2020/12/31  2020/12/31  2020/12/31  2020/12/31     No   No  Yes   No
3    2021/12/31  2021/12/31  2021/12/31  2021/12/31     No   No  Yes  Yes
14   2021/12/31  2021/12/31  2021/12/31  2021/12/31     No   No  Yes   No
15   2020/12/31  2020/12/31  2020/12/31  2020/12/31     No  Yes  Yes   No

Difference between first and last function:

df_last = (df.sort_values('Date', ascending=True)
             .pivot_table(index='AId', columns='FId',
                          values=['Date', 'Output'],
                          aggfunc='last'))

df_first = (df.sort_values('Date', ascending=True)
              .pivot_table(index='AId', columns='FId',
                           values=['Date', 'Output'],
                           aggfunc='first'))
>>> df_last.compare(df_first)

           Date                                                                                    
FId         80                      81                      628                     629            
           self       other        self       other        self       other        self       other
AId                                                                                                
14   2021/12/31  2020/12/31  2021/12/31  2020/12/31  2021/12/31  2020/12/31  2021/12/31  2020/12/31
Corralien
  • 109,409
  • 8
  • 28
  • 52
  • Is this also applicable when there are more than 2 date values? – DGMS89 Jun 06 '23 at 13:12
  • 1
    It's exactly the case here. When you want to keep one value for one cell and you have multiple values, `first` or `last` takes the first or the last value available. That's why I sort the dataframe before pivoting. I sort the dates in ascending order to keep the last value. – Corralien Jun 06 '23 at 13:13
  • 1
    More information about pivot: [How can I pivot a dataframe?](https://stackoverflow.com/q/47152691/15239951) – Corralien Jun 06 '23 at 13:16
  • A follow up question: when this is performed, I lose the A_Id columns name, meaning I cannot do a .loc on it. Is there a way to solve this? – DGMS89 Jun 06 '23 at 13:24
  • I updated my answer according your request. – Corralien Jun 06 '23 at 13:30
  • Thanks for that, but no need to update the question, as this las question was out of scope for the original question. I was just wandering how a .loc based on a AId/FId would work in this output DF. – DGMS89 Jun 06 '23 at 13:34
  • 1
    My answer is also wrong!!! If you want to use `.loc` you don't need the column name. Just use something like `df_piv.loc[2, ('Date', 80)]` to extract the Date of AId=2 and FId=80 – Corralien Jun 06 '23 at 13:39
0
df.pivot_table(index='AId', columns='FId',
               values=['Date', 'Output'],
               aggfunc='last').sort_values('Date', ascending=True)
Mamed
  • 1,102
  • 8
  • 23