This is the ultimate Pandas challenge, in my mind, though it may be elementary to some of you out there...
I am trying to link a particular job position with the survey items to which it corresponds. For example, the president of site A would be attributed to results from a survey item for which respondents from site A provide feedback (i.e "To what degree do you agree with the following statement?: "I think the quality of site A is sufficient overall"").
Each site has 5 stations (0 through 4). Each job position is assigned to one or more station(s) at one or more site(s).
For example, the president of a site works at all stations within that site, while a contractor might only work at a couple of stations, maybe at 2 different sites.
Survey data was collected on the quality of each station within each site.
Some survey items pertain to certain stations within one or more sites.
For example, the "Positions" table looks like this:
import pandas as pd
import numpy as np
pos = pd.DataFrame({'Station(s)':[',1,2,,','0,1,2,3,4'],
'Position':['Contractor','President'],
'Site(s)':['A,B','A'],
'Item(s)':['1','1,2']
})
pos[['Position','Site(s)','Station(s)','Item(s)']]
Position Site(s) Station(s) Item(s)
0 Contractor A,B ,1,2,, 1
1 President A 0,1,2,3,4 1,2
The survey data table looks like this:
sd = pd.DataFrame({'Site(s)':['A','B','B','C','A','A'],
'Station(s)':[',1,2,,',',1,2,,',',,,,',',1,2,,','0,1,2,,',',,2,,'],
'Item 1':[1,1,0,0,1,np.nan],
'Item 2':[1,0,0,1,0,1]})
sd[['Site','Station(s)','Item 1','Item 2']]
Site Station(s) Item 1 Item 2
0 A ,1,2,, 1 1
1 B ,1,2,, 1 0
2 B ,,,, 0 0
3 C ,1,2,, 0 1
4 A 0,1,2,, 1 0
5 A ,,2,, NaN 1
2 side notes:
The item data has been coded to 1 and 0 for unimportant reasons.
The comma separated responses are actually condensed from columns (one column per station and item). I only mention that because if it is better to not condense them, that can be done (or not).
So here's what I need:
This (I think):
Contractor President Site(s) Station(s) Item 1 Item 2
0 1 1 A ,1,2,, 1 1
1 1 0 B ,1,2,, 1 0
2 0 0 B ,,,, 0 0
3 0 0 C ,1,2,, 0 1
4 0 1 A 0,1,2,, 1 0
5 1 1 A ,,2,, NaN 1
Logic:
The contractor works at site A and B and should only be associated with respondents who work either of those sites. Within those respondents, he should only associated with those who work at stations 1 or 2 but none that also work at any others (i.e. station 0).
Therefore, the contractor's rows of interest in df2 are indices 0, 1, and 5. The president's rows of interest are from indices 0, 4, and 5.
...and, ultimately, this:
Position Overall%
0 Contractor 100
1 President 80
Logic:
Because the president is concerned with items 1 and 2, there are 5 numbers to consider: (1 and 1) from item 1 and (1, 0, and 1) from item 2. The sum across items is 4 and the count across items is 5 (again, do not count 'NaN'), which gives 80%.
Because the contractor is only concerned with item 1, there are 2 numbers to consider: 1 and 1 - 'NaN' should not be counted - (from the rows of interest, respectively). Therefore, the sum is 2 out of the count, which is 2, which gives 100%
Thanks in advance!
Update
I know this works (top answer just under question), but how can that be applied to this situation? I tried this (just to try the first part of the logic):
for i in pos['Position']:
sd[i]=[x for x in pos.loc['Site(s)'] if x in sd['Site']]
...but it threw this error:
KeyError: 'the label [Site(s)] is not in the [index]'
...so I'm still wrestling with it.