I have a CSV file that contains users and their questions and answers for prescreening questions on a job requisition. There are cases where a given question can have multiple answers. Below is how the CSV currently looks:
User,RequisitionID,Question,Answer
user1,190004116,6162,7296
user2,190004086,6115,7260
user2,190004086,6117,7264
user2,190004086,6117,7265
user2,190004086,6117,7268
user2,190004086,6117,7269
user3,190005321,6321,4221
user3,190005321,6321,4322
Instead of the above format, I would like to have Answer data represented in separate columns (with as many columns as there are distinct answers per Question) grouped by User, RequisitionID and Question, like so:
User,RequisitionID,Question,Answer1,Answer2,Answer3,Answer4
user1,190004116,6162,7296,,,
user2,190004086,6115,7260,,,
user2,190004086,6117,7264,7265,7268,7269
user3,190005321,6321,4221,4321,
I've tried using "groupby" as below, but I'm just not able to get the resulting Dataframe in the shape I want...
reqPrscrAnsFileFiltered = reqPrscrAnsFileFiltered.groupby(['User','RequisitionID','Question']).Answer.apply(list)
reqPrscrAnsFileFiltered = pandas.DataFrame(reqPrscrAnsFileFiltered.tolist(), index=reqPrscrAnsFileFiltered.index)
Sorry, but I'm new to Python and Pandas, so any help would be great.
Well - we're getting there. I implemented the pivot_table code snippet you provided in your comment but I'm afraid I'm not sure what you mean about collapsing the Multiindex. Without doing so, the Dataframe output is as follows:
Answer,Answer,Answer,Answer
1,2,3,4
7296.0,,,
7260.0,,,
7264.0,7265.0,7268.0,7269.0
4221.0,4322.0,,
Can you help me understand how to make it look like the desired output above in my post? Specifically, I want to remove the row with the answer index numbers (1,2,3,4) and then add back in the missing columns of User, RequisitionID and Question to the left of the answer columns. Apologies if this is elementary stuff...