Filter out rows of a dataframe containing a specific string

Question

I have a massive dataframe. The dataframe has column patient.drug. This column contains list of dictionaries as its elements. I want to filter out all the rows that contain 'NIFEDIPINE' word in patient.drug column.

The dataframe is very large. Here is a sample of it.

                                                         patient.drug
0                       [{'drugcharacterization': '1', 'medicinalproduct': 'PANDOL'}]
1                       [{'drugcharacterization': '2', 'medicinalproduct': 'NIFEDIPINE'}]      
2                       [{'drugcharacterization': '3', 'medicinalproduct': 'SIMVASTATIN'}]
3                       [{'drugcharacterization': '4', 'medicinalproduct': 'NIFEDIPINE'}]

so far, I have tried

df[df['patient.drug'].str.contains('NIFEDIPINE')]

but it is giving me an error.

 raise KeyError(f"None of [{key}] are in the [{axis_name}]")

KeyError: "None of [Float64Index([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan,\n              ...\n              nan, nan, nan, nan, nan, nan, nan, nan, nan, nan],\n             dtype='float64', length=12000)] are in the [columns]"

I have also tried using in operator and iterating over rows.

lst=[]
for i in range(len(df)):
    if 'NIFEDIPINE' in df.loc[i, "patirnt.drug"]:
        lst.append(i)
print(lst)

Which is also causing an error. What should I do to get it right?

Does this answer your question? [pandas dataframe str.contains() AND operation](https://stackoverflow.com/questions/37011734/pandas-dataframe-str-contains-and-operation) — Nk03, Jun 25 '21 at 05:46
Seems like you have (a lot of) `NaN`s in `patient.drug` column. So please try `df[df["patient.drug"].str.contains("NIFEDIPINE", na=False)]` which will say `NaN`s are `False` instead of themselves in the return value, which you can then hopefully safely use as a mask. — Mustafa Aydın, Jun 25 '21 at 05:49
Can you include a small sample of your dataframe? It would make your question and desired output much clearer. — rahlf23, Jun 25 '21 at 06:01
What pandas version you are using? can you show us the dataFrame which you are using right now that will help to get answer. Because the data everyone is expecting is the not the way. — Karn Kumar, Jun 25 '21 at 13:53
Please see my answer, thats Should work as i reproduced the same error, this is Because it see `dtype='float64'` hene we need to convert it to string to carry the `contains` operation on it. — Karn Kumar, Jun 25 '21 at 14:29

SeaBean · Answer 1 · 2021-06-25T15:11:45.090

1

Suppose you have this layout of column:

Search string 'NIFEDIPINE' found on the 2nd and 4th entries:

data = {'patient.drug': 
     [[{'drugcharacterization': '1', 'medicinalproduct': 'PANDOL'}],
      [{'drugcharacterization': '2', 'medicinalproduct': 'NIFEDIPINE'}],
      [{'drugcharacterization': '3', 'medicinalproduct': 'SIMVASTATIN'}],
      [{'drugcharacterization': '4', 'medicinalproduct': 'NIFEDIPINE'}],
     ]
}
df = pd.DataFrame(data)

                                                         patient.drug
0       [{'drugcharacterization': '1', 'medicinalproduct': 'PANDOL'}]
1   [{'drugcharacterization': '2', 'medicinalproduct': 'NIFEDIPINE'}]      <=== keyword here
2  [{'drugcharacterization': '3', 'medicinalproduct': 'SIMVASTATIN'}]
3   [{'drugcharacterization': '4', 'medicinalproduct': 'NIFEDIPINE'}]      <=== keyword here

(Layout digged out from your previous questions)

Solution:

[Updated for 1) support more than one dict in a list and 2) partial string match].

Use: .loc + .explode() + .apply():

keyword = 'NIFEDIPINE'
df.loc[df['patient.drug'].explode().apply(lambda d: keyword in ' '.join(d.values())).any(level=0)]

Result:

Rows with keyword string 'NIFEDIPINE' correctly extracted and displayed:

                                                        patient.drug
1  [{'drugcharacterization': '2', 'medicinalproduct': 'NIFEDIPINE'}]
3  [{'drugcharacterization': '4', 'medicinalproduct': 'NIFEDIPINE'}]

edited Jun 25 '21 at 15:11

answered Jun 25 '21 at 08:28

SeaBean

22,547
3
13
25

I tried this. It is giving an error: ```ValueError: cannot reindex from a duplicate axis ``` – user16309118 Jun 25 '21 at 13:48
@user16309118, you have the same dataFrame as SeaBean shown in his answer? – Karn Kumar Jun 25 '21 at 14:01
1

@KarnKumar Yes, I have added a sample in the question. – user16309118 Jun 25 '21 at 14:06
@user16309118 Is there some entries can have more than one dict in the list ? – SeaBean Jun 25 '21 at 14:43
@user16309118 See my edit above. Now support more than one dict in a list. Please retry. – SeaBean Jun 25 '21 at 14:50
@SeaBean Yes, some entries have more than one dictionaries. – user16309118 Jun 25 '21 at 14:50
@user16309118 Please retry with my solution and let me know the result. Anyway you are free to choose your solution. :-) – SeaBean Jun 25 '21 at 14:52
@SeaBean I just tried. It worked. But it is giving me 13 rows, whereas it should give me 19 rows. Anyways, thank you so much! – user16309118 Jun 25 '21 at 14:56
@user16309118 Alright! Probably your data has some special layout/features not shown on the sample data I crafted. Therefore, cannot be detected. Anyway, suggest you to craft better sample data so that people can easier to help you. If I didn't dig out your sample data from your previous questions. probably this question is still pending without solved. – SeaBean Jun 25 '21 at 15:01
Probably there is partial string match required rather than full match. Updated to support partial string match now. – SeaBean Jun 25 '21 at 15:14

Karn Kumar · Accepted Answer · 2021-06-25T14:53:45.710

After reproducing your Data,

>>> df
                                                         patient.drug
0  [{'drugcharacterization': '1', 'medicinalproduct': 'PANDOL'}]
1  [{'drugcharacterization': '2', 'medicinalproduct': 'NIFEDIPINE'}]
2  [{'drugcharacterization': '3', 'medicinalproduct': 'SIMVASTATIN'}]
3  [{'drugcharacterization': '3', 'medicinalproduct': 'SIMVASTATIN'}]
4  [{'drugcharacterization': '4', 'medicinalproduct': 'NIFEDIPINE'}]

While using Your code:

>>> df[df['patient.drug'].str.contains('NIFEDIPINE')]

Error:

    raise KeyError(f"None of [{key}] are in the [{axis_name}]")
KeyError: "None of [Float64Index([nan, nan, nan, nan, nan], dtype='float64')] are in the [columns]"

Solution:

    >>> df[df['patient.drug'].astype('str').str.contains('NIFEDIPINE')]
                                                        patient.drug
1  [{'drugcharacterization': '2', 'medicinalproduct': 'NIFEDIPINE'}]
4  [{'drugcharacterization': '4', 'medicinalproduct': 'NIFEDIPINE'}]

Note:

This is raising issue due to indexer check in the pandas indexer.py section, which is as follows:

--> pandas/core/indexing.py

# Count missing values:
missing_mask = indexer < 0
missing = (missing_mask).sum()

if missing:
    if missing == len(indexer):
        axis_name = self.obj._get_axis_name(axis)
        raise KeyError(f"None of [{key}] are in the [{axis_name}]")

    # We (temporarily) allow for some missing keys with .loc, except in
    # some cases (e.g. setting) in which "raise_missing" will be False

Sarvesh Dubey · Answer 3 · 2021-06-25T07:08:52.733

It was unclear if your column's each element is list of dictionary or just dictionary. Anyways, I have included solution for both cases.

In case each element of column is dictionary then try below:

import pandas as pd

a = [1, 2, 3, 4, 6]
b = [{'a':'A'}, {'b':'B'}, {'c':'C'}, {'d':'D'}, {'e':'E'}]
df = pd.DataFrame({'A': a, 'B': b})

df[df['B'].apply(lambda x: 'a' in x)]

This gives output as:

A   B
1   {'a': 'A'}

In your case

df[df['B'].apply(lambda x: 'NIFEDIPINE' in x)]

In case each element of column is list of dictionaries then try below:

import pandas as pd

a = [1, 2, 3, 4, 6]
b = [[{'a':'A'}], [{'b':'B'}], [{'c':'C'}], [{'d':'D'}], [{'e':'E'}]]
df = pd.DataFrame({'A': a, 'B': b})

def check(key, dict_list):
  for map in dict_list:
    if key in map:
      return True
  return False

df[df['B'].apply(lambda x: check('a', x))]

score 0 · Answer 4 · answered Jun 25 '21 at 07:11

0

You can use isin

drug_name = ['NIFEDIPINE']

df_NIFEDIPINE = df[df['patient.drug'].isin(drug_name)].reset_index()

answered Jun 25 '21 at 07:11

nkwachiabel

83
5

Filter out rows of a dataframe containing a specific string

4 Answers4

Solution: