0

This is the ultimate Pandas challenge, in my mind, though it may be elementary to some of you out there...

I am trying to link a particular job position with the survey items to which it corresponds. For example, the president of site A would be attributed to results from a survey item for which respondents from site A provide feedback (i.e "To what degree do you agree with the following statement?: "I think the quality of site A is sufficient overall"").

Each site has 5 stations (0 through 4). Each job position is assigned to one or more station(s) at one or more site(s).

For example, the president of a site works at all stations within that site, while a contractor might only work at a couple of stations, maybe at 2 different sites.

Survey data was collected on the quality of each station within each site.

Some survey items pertain to certain stations within one or more sites.

For example, the "Positions" table looks like this:

import pandas as pd
import numpy as np

pos = pd.DataFrame({'Station(s)':[',1,2,,','0,1,2,3,4'],
                    'Position':['Contractor','President'],
                    'Site(s)':['A,B','A'],
                    'Item(s)':['1','1,2']
                   })

pos[['Position','Site(s)','Station(s)','Item(s)']]

    Position    Site(s)     Station(s)   Item(s)
0   Contractor  A,B          ,1,2,,         1
1   President   A           0,1,2,3,4       1,2

The survey data table looks like this:

sd = pd.DataFrame({'Site(s)':['A','B','B','C','A','A'],
                   'Station(s)':[',1,2,,',',1,2,,',',,,,',',1,2,,','0,1,2,,',',,2,,'],
                   'Item 1':[1,1,0,0,1,np.nan],
                   'Item 2':[1,0,0,1,0,1]})
sd[['Site','Station(s)','Item 1','Item 2']]

    Site    Station(s)  Item 1  Item 2
0      A       ,1,2,,          1    1
1      B       ,1,2,,          1    0
2      B       ,,,,            0    0
3      C       ,1,2,,          0    1
4      A       0,1,2,,         1    0
5      A       ,,2,,           NaN  1

2 side notes:

  1. The item data has been coded to 1 and 0 for unimportant reasons.

  2. The comma separated responses are actually condensed from columns (one column per station and item). I only mention that because if it is better to not condense them, that can be done (or not).

So here's what I need:

This (I think):

    Contractor  President Site(s)  Station(s)   Item 1  Item 2
0      1           1        A       ,1,2,,        1       1
1      1           0        B       ,1,2,,        1       0
2      0           0        B       ,,,,          0       0
3      0           0        C       ,1,2,,        0       1
4      0           1        A       0,1,2,,       1       0
5      1           1        A       ,,2,,         NaN     1

Logic:

The contractor works at site A and B and should only be associated with respondents who work either of those sites. Within those respondents, he should only associated with those who work at stations 1 or 2 but none that also work at any others (i.e. station 0).

Therefore, the contractor's rows of interest in df2 are indices 0, 1, and 5. The president's rows of interest are from indices 0, 4, and 5.

...and, ultimately, this:

    Position    Overall%
0   Contractor  100
1   President   80

Logic:

Because the president is concerned with items 1 and 2, there are 5 numbers to consider: (1 and 1) from item 1 and (1, 0, and 1) from item 2. The sum across items is 4 and the count across items is 5 (again, do not count 'NaN'), which gives 80%.

Because the contractor is only concerned with item 1, there are 2 numbers to consider: 1 and 1 - 'NaN' should not be counted - (from the rows of interest, respectively). Therefore, the sum is 2 out of the count, which is 2, which gives 100%

Thanks in advance!

Update

I know this works (top answer just under question), but how can that be applied to this situation? I tried this (just to try the first part of the logic):

for i in pos['Position']:
    sd[i]=[x for x in pos.loc['Site(s)'] if x in sd['Site']]

...but it threw this error:

KeyError: 'the label [Site(s)] is not in the [index]'

...so I'm still wrestling with it.

Community
  • 1
  • 1
Dance Party
  • 3,459
  • 10
  • 42
  • 67

1 Answers1

1

If I understand correctly, you want to add one column to sd for each job position in pos. (This is the first task.)

So, for each row index i in pos (we iterate over rows in pos), we can create a unique boolean column:

# PSEUDOCODE:
sd[position_name_i] = (sd['Site'] IS_CONTAINED_IN pos.loc[i,'Site(s)']) and (sd['Station(s)'] IS_CONTAINED_IN pos.loc[i,'Station(s)'])

I hope the logic here is clear and consistent with your goal.

The expression X IS_CONTAINED_IN Y may be implemented in many different ways. (I can think of X and Y being sets and then it's X.subset(Y). Or in terms of bitmasks X, Y and bitwise_xor.)

The name of the column, position_name_i may be simply an integer i. (Or something more meaningful as pos.loc[i,'Position'] if this column consists of unique values.)

If this is done, we can do the other task. Now, df[df[position_name_i]] will return only the rows of df for which position_name_i is True.

We iterate over all positions (i.e. rows in pos). For each position:

# number of non-nan entries in 'Item 1' and 'Item 2' relevant for the position:
total = df.loc[df['position_name_i'], ['Item 1', 'Item 2']].count().sum()
# number of 1's among these entries: 
partial = df.loc[df['position_name_i'], ['Item 1', 'Item 2']].sum().sum()

The final Overall% for the given position is 100*partial/total.

ptrj
  • 5,152
  • 18
  • 31
  • Your logic seems sound. For stations, both of these conditions must be met: 1. At least one element in survey (sd['Station(s)']) must exist in positions (pos['Station(s)']). 2. No element in survey (sd['Station(s)']) must not be found in the positions (pos['Station(s)']. – Dance Party May 15 '16 at 19:32
  • For sites, the site from sd['Site(s)'] must exist in those from pos['Site(s)']. – Dance Party May 15 '16 at 19:37
  • I'm not sure if I understand condition 2 for stations. Condition 1 is "intersection of X and Y is not empty", I guess. Anyway, whatever the conditions, they can always be written in the language of set theory (i.e. as logical sentences about set unions, intersections, inclusions etc.) If you think my answer is not quite correct, feel free to edit it. And please ask if something is unclear. – ptrj May 15 '16 at 20:16
  • I think this might be a TL;DR type of post. I'm going to post smaller questions that will lead up to an answer, but I'll leave this up in the meantime. I'll link to each individual post as I post them. – Dance Party May 16 '16 at 00:31
  • First question: http://stackoverflow.com/questions/37245220/pandas-create-columns-from-rows-in-other-data-frame-with-criteria – Dance Party May 16 '16 at 00:44
  • Second question: http://stackoverflow.com/questions/37245338/pandas-pivot-table-with-fancy-stacking – Dance Party May 16 '16 at 01:05