0

I'm trying to subset (retrieve a set of rows) a python pandas data frame by using pd.filter with a regex string to identify the columns of interest before performing a subset based on the values in those columns.

For example, this is my mock data frame:

id status status_drug_use drugA drugA_use    drugB  drugB_use
0  1      analgesic       0     None         1      hypertensive
1  0      analgesic       1     analgesic    1      hypertensive
2  0      analgesic       1     hypertensive 0      None
3  1      analgesic       0     None         1      analgesic

I would like all rows that contain the values in columns drugA_use or drugB_use which match the value in status_drug_use. As per the example, this would return the two rows:

id status status_drug_use drugA drugA_use    drugB  drugB_use
1  0      analgesic       1     analgesic    1      hypertensive
3  1      analgesic       0     None         1      analgesic

There are a few column name conventions to stick with:

  1. status_drug_use is always there.
  2. The matching columns (drugA_use and drugB_use) always follow the template <ANYTHING>_use.

Alteration There is a second scenario, one in which I would like to perform a comparison between a user defined string eg analgesic and the two columns drugA_use and drugB_use. This is different from using the content of status_drug_use.

Anthony Nash
  • 834
  • 1
  • 9
  • 26

1 Answers1

1

Here's a way to do what you've asked:

df2 = df.assign(all_use=df.apply(
    lambda x: list(x[[col for col in df.columns if col.endswith('_use') and col != 'status_drug_use']]), 
    axis=1)).explode(
    'all_use').query('status_drug_use == all_use').drop_duplicates().drop(columns='all_use')

Input:

  id status status_drug_use drugA     drugA_use drugB     drugB_use
0  0      1       analgesic     0          None     1  hypertensive
1  1      0       analgesic     1     analgesic     1  hypertensive
2  2      0       analgesic     1  hypertensive     0          None
3  3      1       analgesic     0          None     1     analgesic

Output:

  id status status_drug_use drugA  drugA_use drugB     drugB_use
1  1      0       analgesic     1  analgesic     1  hypertensive
3  3      1       analgesic     0       None     1     analgesic

Explanation:

  • find the subset of all columns ending in _use (excluding status_drug_use)
  • add a column named all_use whose value for a given row is a list of the values in the columns ending in _use
  • use explode() to add rows such that for each original row, there are now multiple rows, one for each of the values in all_use for the original row
  • use query() to select only rows where status_drug_use matches the value in all_use
  • use drop_duplicates to eliminate rows in case there were multiple matches for any rows in the original dataframe (for example, if both drugA_use and drugB_use contained "analgesic" and so did status_drug_use)
  • drop the column all_use as we no longer need it.

UPDATE: Addressing OP's question in a comment: 'Rather than using the values in column status_drug_use, how do I achieve the same output but by using a single user defined string e.g., "analgesic"?'

You can do this by having the user defined query string (call it user_defined_str) as a variable and changing the contents of query() by replacing the column name status_drug_use with the variable name with @ prepended: @user_defined_str (see the query() docs here for more detail).

user_defined_str = 'analgesic'
df3 = df.assign(all_use=df.apply(
    lambda x: list(x[[col for col in df.columns if col.endswith('_use') and col != 'status_drug_use']]), 
    axis=1)).explode(
    'all_use').query('@user_defined_str == all_use').drop_duplicates().drop(columns='all_use')
constantstranger
  • 9,176
  • 2
  • 5
  • 19
  • Thanks. Certainly does what I want. I've updated the original question as I have a closely related question. Rather than using the values in column status_drug_use, how do I achieve the same output but by using a single user defined string e.g., "analgesic"? The result should be the same, it only differs when None entries make there way in the status_drug_use column. – Anthony Nash May 23 '22 at 21:28
  • See my updated answer to address this. For additional follow-up queries, please ask a new question as that's the usual protocol (feel free to link it in the comments here if you'd like to make sure I see it). – constantstranger May 23 '22 at 22:10