Filtering out data

Question

I am trying to filter out values from a pandas data-frame and then generate a column with those values. To further clarify myself here is an example

print (temp.head())

Index  Work-Assigned  Location
  A         R            NL
  B         df           MB
  A                      NL
  C         SL           NL
  D         RC           MB
  A         RC           AB

Now what I want to do is to filter out all the R and SL values from this data-frame and create another data-frame with just those values and the index. Something like this:

print (result.head())

Index    R/SL
  A       R
  B      
  C       SL
  D

I tried pivoting the data with Work-Assigned as the value, as you see certain value in the index column is repeated, but that didn't work.

so something like `df[df['Work-Assigned'].isin(['R', 'SL'])][['Index', 'Work-Assigned']]`? Though it's not clear to me how the rows 'B' and 'D' ended up in your output... — pault, Mar 15 '18 at 18:40
For an index that does not has either 'R' or 'SL' as values, I want them to have empty spaces or NAN value. To further clarify for an Index that has multiple work values I only want the 'R" and 'SL' values, removing all other values and for an Index that only has a NAN value for work, I want to keep that as it is. — , Mar 16 '18 at 16:32

jeschwar · Answer 1 · 2018-03-19T16:19:43.067

I believe the following gives the requested output:

# your original dataframe
df = pd.DataFrame({'Index': {0: 'A', 1: 'B', 2: 'A', 3: 'C', 4: 'D', 5: 'A'}, 'Location': {0: 'NL', 1: 'MB', 2: 'NL', 3: 'NL', 4: 'MB', 5: 'AB'}, 'Work-Assigned': {0: 'R', 1: 'df', 2: '', 3: 'SL', 4: 'RC', 5: 'RC'}}).set_index('Index').reindex(['Work-Assigned', 'Location'], axis=1)


df
Out[5]: 
      Work-Assigned Location
Index                       
A                 R       NL
B                df       MB
A                         NL
C                SL       NL
D                RC       MB
A                RC       AB

def some_filtering(df_, filter_values=['R', 'SL']):
    # use regex to create a Series which contains bool of whether any `filter_values` are found
    s_filter = df_['Work-Assigned'].str.extract('^({})$'.format('|'.join(filter_values)), expand=False)

    # if nothing was found then return a blank string; otherwise return the unique value found
    if s_filter.dropna().empty:
        val = ['']
    else:
        val = pd.unique(s_filter.dropna())

    # return a DataFrame containing the unique value found (could be blank) at the present index value passed to .groupby
    return pd.DataFrame(data=val, index=pd.unique(df_.index), columns=['/'.join(filter_values)])


df.groupby(level='Index', group_keys=False).apply(some_filtering)
Out[7]: 
  R/SL
A    R
B     
C   SL
D

I tried what you suggested but the result data frame is similar to the original one there weren't any changes made. — , Mar 16 '18 at 16:28
This doesn't produce the desired output. See the comments under the question where OP clarifies. — pault, Mar 17 '18 at 04:34
revised answer to show blanks when none of `['R', 'SL']` are found — jeschwar, Mar 19 '18 at 16:21

score 0 · Answer 2 · answered Mar 16 '18 at 17:51

IIUC, you want to group by Index and collect the values into a set. Then check the set for the values 'R' or 'SL'.

Assuming your DataFrame is named df, you could do the following:

Group by 'Index' and apply the set constructor to the 'Work-Assigned' column. This will condense all distinct values for each Index into one row.

df2 = pd.DataFrame(df.groupby('Index')['Work-Assigned'].apply(set)).reset_index()
print(df2)
#  Index Work-Assigned
#0     A  {nan, R, RC}
#1     B          {df}
#2     C          {SL}
#3     D          {RC}

Next check for the intersection of each row's set with the values you want to search for. If the intersection is null, return an empty string (or np.nan if you prefer). Otherwise, pick the first value.¹

my_values = {'R', 'SL'}
df2['Work-Assigned'] = df2['Work-Assigned'].apply(
    lambda x: '' if not my_values.intersection(x) else list(my_values.intersection(x))[0]
)
print(df2)
#  Index Work-Assigned
#0     A             R
#1     B              
#2     C            SL
#3     D

References

grouping rows in list in pandas groupby

Notes

¹ In the case where multiple (in your case both) values exist, you will get one arbitrarily. If that is a problem, please update your problem statement on how you would like to handle that case.

Did this answer your question? If so, please consider accepting it by clicking the check mark. You are under no obligation to do so, but it can be helpful for readers in the future who may come across a similar issue. Read more on [what to do when someone answers my question](https://stackoverflow.com/help/someone-answers). — pault, Mar 20 '18 at 16:32

Filtering out data

2 Answers2