1

I have a CSV file that contains users and their questions and answers for prescreening questions on a job requisition. There are cases where a given question can have multiple answers. Below is how the CSV currently looks:

User,RequisitionID,Question,Answer
user1,190004116,6162,7296
user2,190004086,6115,7260
user2,190004086,6117,7264
user2,190004086,6117,7265
user2,190004086,6117,7268
user2,190004086,6117,7269
user3,190005321,6321,4221
user3,190005321,6321,4322

Instead of the above format, I would like to have Answer data represented in separate columns (with as many columns as there are distinct answers per Question) grouped by User, RequisitionID and Question, like so:

User,RequisitionID,Question,Answer1,Answer2,Answer3,Answer4
user1,190004116,6162,7296,,,
user2,190004086,6115,7260,,,
user2,190004086,6117,7264,7265,7268,7269
user3,190005321,6321,4221,4321,

I've tried using "groupby" as below, but I'm just not able to get the resulting Dataframe in the shape I want...

reqPrscrAnsFileFiltered = reqPrscrAnsFileFiltered.groupby(['User','RequisitionID','Question']).Answer.apply(list)
reqPrscrAnsFileFiltered = pandas.DataFrame(reqPrscrAnsFileFiltered.tolist(), index=reqPrscrAnsFileFiltered.index)

Sorry, but I'm new to Python and Pandas, so any help would be great.

Well - we're getting there. I implemented the pivot_table code snippet you provided in your comment but I'm afraid I'm not sure what you mean about collapsing the Multiindex. Without doing so, the Dataframe output is as follows:

Answer,Answer,Answer,Answer
1,2,3,4
7296.0,,,
7260.0,,,
7264.0,7265.0,7268.0,7269.0
4221.0,4322.0,,  

Can you help me understand how to make it look like the desired output above in my post? Specifically, I want to remove the row with the answer index numbers (1,2,3,4) and then add back in the missing columns of User, RequisitionID and Question to the left of the answer columns. Apologies if this is elementary stuff...

Sandman
  • 17
  • 5
  • This is a `pivot_table` problem, after defining a new column with `groupby` + `cumcount`, which allows you to pivot properly. – ALollz Aug 28 '19 at 16:22
  • Thanks, ALollz. I tried your suggestion of using pivot_table with the new column from groupby + cumcount, but I'm still struggling to get the right shape of the resulting Dataframe. As I said, I'm new to Python and Pandas so this is a bit out of my knowledge area... – Sandman Aug 28 '19 at 19:24
  • First, create the index you need with `df['idx'] = df.groupby(['User', 'Question']).cumcount()+1`. Next you pivot: `df = df.pivot_table(index=['User', 'Question', 'RequisitionID'], columns=['idx'], values=['Answer'])` From there you should be able to use those two duplicacates to collapse the MultiIndex on the columns, and then you can `.reset_index()` – ALollz Aug 28 '19 at 19:28
  • @ALollz - I updated my post with an update and and a couple additional questions. I appreciate your help as I'm sure this is all rudimentary stuff for you (just not for me). – Sandman Aug 28 '19 at 20:03
  • With a MultiIndex, you have a tuple (so two values here). You can combine this into a single string with `df.columns = [''.join(map(str,x)) for x in df.columns]`. Then do `df = df.reset_index()` and that will move everything from the index (user, question, requisitionID) back to normal columns columns. Those two steps should complete everything. – ALollz Aug 28 '19 at 20:12
  • Wow - you really know your stuff! That provided exactly what I needed! I'll definitely keep what I've learned here from you well in mind next time I encounter something similar. Many thanks. Just wish I could upvote your comments... – Sandman Aug 28 '19 at 20:30
  • Glad it helped! – ALollz Aug 28 '19 at 20:31

0 Answers0