Scenario: I have a dataframe with multiple columns. Four of them (A_Id, FF_Id, Date and Output) are relevant.
Objective: I am trying to extract these columns and create a dataframe with the format Index=A_Id, Columns=F_Id, and the values come from the column Output. Whenever there are two or more values of a given combination of A and F, use the column Date to select the latest available.
Data Sample: Each given A_Id entry has a value in combination with each F_Id entry. a given combination can have 2 or more results with different dates . There are 10 unique F_Id, 7000 A_Id. Here is a subset
FId Date Output AId
628 2020/12/31 Yes 1
629 2020/12/31 No 1
080 2020/12/31 No 1
081 2020/12/31 No 1
628 2020/12/31 Yes 2
629 2020/12/31 No 2
080 2020/12/31 No 2
081 2020/12/31 No 2
628 2021/12/31 Yes 3
629 2021/12/31 Yes 3
080 2021/12/31 No 3
081 2021/12/31 No 3
628 2020/12/31 Yes 14
629 2020/12/31 No 14
080 2020/12/31 No 14
081 2020/12/31 No 14
628 2021/12/31 Yes 14
629 2021/12/31 No 14
080 2021/12/31 No 14
081 2021/12/31 No 14
628 2020/12/31 Yes 15
629 2020/12/31 No 15
080 2020/12/31 No 15
081 2020/12/31 Yes 15
Desired ouptut: A Matrix using A_ID as the index and F_Id as columns. Either in one dataframe with 2d, where the first dimension has Output and the Second has the Dates:
4628 4629 5080 5081 4628 4629 5080 5081
1 Yes No No No 2020/12/31 2020/12/31 2020/12/31 2020/12/31
2 Yes No No No 2020/12/31 2020/12/31 2020/12/31 2020/12/31
3 Yes Yes No No 2021/12/31 2021/12/31 2021/12/31 2021/12/31
14 Yes No No No 2021/12/31 2021/12/31 2021/12/31 2021/12/31
15 Yes No No Yes 2020/12/31 2020/12/31 2020/12/31 2020/12/31
or two different dataframes, of the same size, one with Output and the other with Dates:
4628 4629 5080 5081
1 Yes No No No
2 Yes No No No
3 Yes Yes No No
14 Yes No No No
15 Yes No No Yes
4628 4629 5080 5081
1 2020/12/31 2020/12/31 2020/12/31 2020/12/31
2 2020/12/31 2020/12/31 2020/12/31 2020/12/31
3 2021/12/31 2021/12/31 2021/12/31 2021/12/31
14 2021/12/31 2021/12/31 2021/12/31 2021/12/31
15 2020/12/31 2020/12/31 2020/12/31 2020/12/31
Obs: There are cases like A_Id 14, where the Outputs are available for two dates. In this case, I am trying to select the Output value for the latest available date, and that date for reference.
What I tried 1: First I tried passing the Output and Date columns into the pivot_table function.:
# source df is read from the raw data on an excel spreadsheet
list_aid =[source['A_Id']]
list_fid = [source['F_Id']]
list_answer = [source['Output']]
testdf33 = source[['AgentId','FactorId','Answer','Date']].pivot_table(values=['Answer','Date'],
index='AgentId',
columns='FactorId',
aggfunc={'Answer': lambda x: ' '.join(x), 'Date': lambda x: ' '.join(x)})
Issue 1: When running this, there is the error, because I am passing a Datestime whereas the function was expecting a string.
TypeError: sequence item 0: expected str instance, Timestamp found
What I tried 2: I forcibly changed the data in the Datecolumn into str format:
testdf11 = source[['AgentId','FactorId','Answer','Date']]
testdf11['Date'] = testdf11['Date'].astype(str)
testdf33 = testdf11.pivot_table(values=['Answer','Date'],
index='AgentId',columns='FactorId',
aggfunc={'Answer': lambda x: ' '.join(x),
'Date': lambda x: ' '.join(x)})
Obs2: This gives me all the data aggregated, but I only need the latest output/dates, not an aggregation of whatever is available.
Question: How can I output only the Output/Date combination for the latest available date?