1

I am a doctor looking at surgical activity in a DataFrame that has 450,000 records and 129 fields or columns. Each row describes a patient admission to hospital, and the columns describe the operation codes and diagnosis codes for the patient.

df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 452883 entries, 0 to 452882
Columns: 129 entries, fyear to opertn_24Type
dtypes: float64(5), int64(14), object(110)
memory usage: 445.7+ MB

There are 24 operation columns for each row. I want to search in the operation columns (1-24) for the codes for pituitary surgery "B041" and "B012", to identify all patients who have had surgery for a pituitary tumor.

I am a total python beginner and have tried using iloc to describe the range of columns (1-24) which appear starting at position 72 in the list of columns but couldn't get it to work. I am quite happy searching for individual values eg "B041" in a single column using df["OPERTN_01"] == "B041" but would ideally like to search multiple columns (all the surgical columns 1-24) more efficiently. I tried searching the whole dataframe using

test = df[df.isin(["B041", "B012"])]

but that just returns the entire dataframe with null values.

So I have a few questions.

  1. How do I identify integer positions (iloc numbers) for columns in a large dataframe of 129 columns? I just listed them and counted them to get the first surgical column ("OPERTN_01") at position 72 — there must be an easier way.
  2. What's the best way to slice a dataframe to select records with multiple values from multiple columns?
SecretAgentMan
  • 2,856
  • 7
  • 21
  • 41
capnahab
  • 343
  • 3
  • 14
  • 2
    Would be nice to a sample input and output in the text form? – Grayrigel Sep 18 '20 at 13:13
  • Sorry not sure what you mean, its difficult to display the 129 columns in this software. – capnahab Sep 18 '20 at 13:17
  • you don't have to show the whole dataframe. Only the sample that can be used to reproduce the problem. Please see this : https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples – Grayrigel Sep 18 '20 at 13:22
  • lets say in one of the 29 columns returns a true value, do you want to return the entire dataframe where that row is true? if there are 5 rows and only row 5 in columns 1-27 contain your value then return that row? – Umar.H Sep 18 '20 at 13:30

2 Answers2

2

Let's use .iloc and create a boolean for you to filter by:

import pandas as pd
import numpy as np
np.random.seed(12)
df = pd.DataFrame({"A" : ["John","Deep","Julia","Kate","Sandy"], 
                     "result_1" : np.random.randint(5,15,5),
                    "result_2" : np.random.randint(5,15,5) })
print(df)

       A  result_1  result_2
0   John        11         5
1   Deep         6        11
2  Julia         7         6
3   Kate         8         9
4  Sandy         8        10

Next we need to find your intended values in the selected columns:

df.iloc[:,1:27].isin([11,10])

this returns:

   result_1  result_2
0      True     False
1     False      True
2     False     False
3     False     False
4     False      True

From the above, we need to slice our original dataframe by the rows where any value is true (if I've understood you correctly).

For this we can use np.where() with .loc:

df.loc[np.where(df.iloc[:,1:].isin([11,10])==True)[0]]

       A  result_1  result_2
0   John        11         5
1   Deep         6        11
4  Sandy         8        10

From here it's a simple task to extract your unique IDs.

halfer
  • 19,824
  • 17
  • 99
  • 186
Umar.H
  • 22,559
  • 7
  • 39
  • 74
1

Answer 1:

Let's say you are looking for eg_col in your dataframe's columns. Then, you can find its index within the columns using:

df.columns.tolist().index('eg_col')

Answer 2:

In your example, if you know the name of the last surgical column (let's say it's called OPERTN_24, you can slice those columns using:

df_op = df.loc[:, 'OPERTN_01':'OPERTN_24']

Continuing from that, we can look for values of 'B041', 'B012' in df_surg as you tried: df_op.isin['B041', 'B012'] which will return the boolean value for all dataframe entries.

To extract, for example, only those rows where at least one of our 'B041' values comes up, we select those rows with:

df.index[df_surg.isin(['B041', 'B012']).any(axis=1)]
aybry
  • 316
  • 1
  • 7
  • answer 1 - good stuff but how about all 24 of the OPERTN columns or a list of all column numbers ? and using ``` df.index[df_surg.isin['B041', 'B012'].any(axis=1)]``` gives ```--------------------------------------------------------------------------- TypeError Traceback (most recent call last) in ----> 1 df.index[df_op.isin['B041', 'B012'].any(axis=1)] TypeError: 'method' object is not subscriptable``` – capnahab Sep 18 '20 at 13:41
  • 1
    Re: error, my mistake, parentheses were missing: `df.index[df_op.isin(['B041', 'B012']).any(axis=1)]`. You can extract the columns containing `'OPERTN'` by using the `.str.match` method (see [here](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.match.html)), and their corresponding `iloc`-type indices with Numpy's `where` method: `np.where(df.columns.str.match('OPERTN') == True)`. – aybry Sep 18 '20 at 14:20
  • Many thanks, Have learnt a lot from that though I still can't get the Jupyter Notebook cell to copy into this forum software despite reading the suggestion by Grayrigel above. – capnahab Sep 18 '20 at 20:34