tally employee occurance in data frame within pandas, when column contains mulitple values

Question

i have the following excel file

ID     EmpName                   date           cost
1      bob smith              01/01/2019     10
2      Jane Doe               01/04/2019     20
3      steve ray, bob smith   01/03/2017     100

If i want to count the occurrences of each person: bob, jane, and steve ... but on ID 3 (as well as other rows) the data in the name field has multiple employees listed which isn't ideal.. what is my best approach to tally this?

Looking for something like this

employee      count       cost
bob smith     2           110
jane doe      1           20
steve ray     1           100

Second question:

If my data is as follows:

ID     EmpName1      Empname2    date           cost
1      bob smith                 01/01/2019     10
2      Jane Doe                  01/04/2019     20
3      steve ray     bob smith   01/03/2017     100

could this be tallied in a similar way?

BENY · Accepted Answer · 2019-05-30T02:52:46.740

using get_dummies

s=df.EmpName.str.get_dummies(', ')
pd.concat([s.sum(),s.mul(df.cost,0).sum()],axis=1)
Out[666]: 
           0    1
Jane Doe   1   20
bobs mith  2  110
steve ray  1  100

Or we using unnesting

df.EmpName=df.EmpName.str.split(',')
unnesting(df,['EmpName']).groupby('EmpName').cost.agg(['sum','count'])
Out[669]: 
          sum  count
EmpName             
JaneDoe    20      1
bobsmith  110      2
steveray  100      1

Update

s=df[['EmpName1','Empname2','cost']].melt(['cost']).groupby('value').cost.agg(['sum','count'])
s.drop('')
Out[678]: 
          sum  count
value               
JaneDoe    20      1
bobsmith  110      2
steveray  100      1

Or wide_to_long

pd.wide_to_long(df,['EmpName'],i=['ID'],j='number').groupby('EmpName').cost.agg(['sum','count'])

def unnesting(df, explode):
    idx = df.index.repeat(df[explode[0]].str.len())
    df1 = pd.concat([
        pd.DataFrame({x: np.concatenate(df[x].values)}) for x in explode], axis=1)
    df1.index = idx

    return df1.join(df.drop(explode, 1), how='left')

if the data had a second column for name, say empname1: bob and empname2: jane doe, assuming we have no comma seperation of names because i fix all of them - could you show me how to tally that also? — Oscalation, May 30 '19 at 02:38
first line would be empname1: bobsmith empname2: NaN cost 10 third line empname1: steve ray empname2: bob smith cost 100 — Oscalation, May 30 '19 at 02:40
reading over melt ... groupby .. etc documentation. A lot to grasp here. Amazing. Thank you! — Oscalation, May 30 '19 at 02:55

Thomas Hayes · Answer 2 · 2019-05-30T02:30:47.870

-1

You will probably want to restructure your data into something that looks more like

ID     EmpName                   date           cost
1      bob smith              01/01/2019     10
2      Jane Doe               01/04/2019     20
3      steve ray              01/03/2017     100
1      bob smith              01/03/2017     100

From this point you can use a groupby and a sum statement to find what you are looking for. Something along the lines of:

df.groupby(['EmpName'])[['cost']].sum()

Not changing this can lead to nightmares at later stages of your analysis. It is best standard to have one record per line to avoid later errors.

edited May 30 '19 at 02:30

answered May 30 '19 at 02:29

Thomas Hayes

102
6

Please do not change the input of op's – BENY May 30 '19 at 02:30
I understand that your answer will give him the solution to problem that he is asking. However, it is terrible practice to structure data in that manner and will surely lead to problems at later stages. Changing this as soon as possible in the best solution. – Thomas Hayes May 30 '19 at 02:32
Providing some information on how to restructure the data to your form would make this answer much more useful, since OP is reading from an Excel file – Untitled123 May 30 '19 at 02:38
I concede. Oscalation has a much more complete answer than I do that has steps you can use for restructuring. You would need to split by the delimiter which is a comma, copy the cost over for each split, then re-group the data and match by the employeeID. I am still recommending to OP that this should be done ASAP. Structuring your data in this manner is an awful idea that will increase your problems by magnitudes if not fixed. – Thomas Hayes May 30 '19 at 02:45
If I understand your comment correctly Thomas, I agree entirely. This data ... with the use of comma delimenation .. basically housing two values in one field .. is just asking for trouble. – Oscalation May 30 '19 at 02:48
Thank you. Also, the fact that @Oscalation is putting this into a Dataframe shows that he is doing something beyond the limitations of Excel with the data. Looking at this data in Excel might be readable to a human but will be a huge problem for a SQL database or any statistical analysis. I want to stress that nothing more advanced will likely even be attainable without the restructuring of data. I recommended that he tackles this problem before any others. – Thomas Hayes May 30 '19 at 03:01

tally employee occurance in data frame within pandas, when column contains mulitple values

2 Answers2