0

I have some data that has come from some monte carlo simulations and basically I want to reduce my initial dataset (quite large) to a smaller one so plotting is easier. The data I want is represented in strings and there can be multiple entries for each simulation, but there is a chance that every string I want may not have been output during that specific simulation. In this case I would like for that to be represented in a column as a '0'.

Basically I would like for the following to happen:

    Run_Number Event
1   1          A
2   1          A
3   1          B
4   1          C
5   2          A
6   2          A
7   2          B
8   3          A

Becomes:

    Run_Number A_cnt B_cnt C_cnt
1   1          2     1     1
2   2          2     1     0
3   3          1     0     0

I've tried grouping the data by the event type and grabbing the size of each event, but if a certain event never happens then I won't get a value for that event and I will end up with incompatible column lengths.

Here is what I am doing, but it is pretty slow

dataframe_collection = {}

#error is one of 3 cases I'm running
for error in Blue['Error_Profile']:
    error_df = Blue.loc[Blue['Error_Profile'] == error]    
    Blue_foo = error_df.groupby(['Run_Number', 'Event_Type'], as_index=False).size() 
    Blue_bar = error_df.groupby(['Run_Number', 'Result'], as_index=False).size()
    nRuns_blue = Blue['Run_Number'].unique()
    
    nEventA_blue = []
    nEventB_blue = []
    nEventC_blue = []
    nEventD_blue = []
    for run in nRuns_blue:
        eventA= Blue_foo[(Blue_foo['Run_Number'] == run) & (Blue_foo['Event_Type'] == 'A')]
        eventB= Blue_foo [(Blue_foo ['Run_Number'] == run) & (Blue_foo ['Event_Type'] == 'B')]
        eventC= Blue_foo [(Blue_foo ['Run_Number'] == run) & (Blue_foo ['Event_Type'] == 'C')]
        eventD= Blue_bar [(Blue_bar ['Run_Number'] == run) & (Blue_bar ['Result'] == 'D')]
        
        nEventA_blue append(0) if eventA.empty else nEventA_blue.append(eventA['size'].item())
        nEventB_blue.append(0) if eventB.empty else nEventB_blue .append(eventB['size'].item())
        nEventC_blue.append(0) if eventC.empty else nEventC_blue.append(eventC['size'].item())
        nEventD_blue.append(0) if eventD.empty else nEventD_blue.append(eventD['size'].item())
    
    data = {'A_cnt': nEventA_blue , 'B_cnt': nEventB_blue , 'C_cnt': nEventC_blue , 'D_cnt': nEventD_blue }
    dataframe_collection[error] = pd.DataFrame(data, index=nRuns_blue)
Wallabee
  • 31
  • 4
  • `pd.crosstab(df['Run_Number'], df['Event'])` (From [this answer](https://stackoverflow.com/a/39132761/15497888)) or `df.groupby(['Run_Number', 'Event']).size().unstack(fill_value=0)` (From [this answer](https://stackoverflow.com/a/39132900/15497888)) – Henry Ecker Sep 03 '21 at 20:29
  • Add `.add_suffix('_cnt').reset_index().rename_axis(columns=None)` to the end of either option above to get exact output. – Henry Ecker Sep 03 '21 at 20:30
  • 1
    @HenryEcker Thank you so much! I knew there had to be an easy way but I just couldn't find what I was looking for. This was a great help. Have a great weekend! – Wallabee Sep 03 '21 at 20:54

0 Answers0