Pandas very slow query

Question

I have the following code which reads a csv file and then analyzes it. One patient has more than one illness and I need to find how many times an illness is seen on all patients. But the query given here

raw_data[(raw_data['Finding Labels'].str.contains(ctr)) & (raw_data['Patient ID'] == i)].size

is so slow that it takes more than 15 mins. Is there a way to make the query faster?

raw_data = pd.read_csv(r'C:\Users\omer.kurular\Desktop\Data_Entry_2017.csv')

data = ["Cardiomegaly", "Emphysema", "Effusion", "No Finding", "Hernia", "Infiltration", "Mass", "Nodule", "Atelectasis", "Pneumothorax", "Pleural_Thickening", "Pneumonia", "Fibrosis", "Edema", "Consolidation"]

illnesses = pd.DataFrame({"Finding_Label":[], 
                     "Count_of_Patientes_Having":[],
                         "Count_of_Times_Being_Shown_In_An_Image":[]}) 

ids = raw_data["Patient ID"].drop_duplicates()

index = 0

for ctr in data[:1]:
    illnesses.at[index, "Finding_Label"] = ctr
    illnesses.at[index, "Count_of_Times_Being_Shown_In_An_Image"] = raw_data[raw_data["Finding Labels"].str.contains(ctr)].size / 12
    for i in ids:
        illnesses.at[index, "Count_of_Patientes_Having"] = raw_data[(raw_data['Finding Labels'].str.contains(ctr)) & (raw_data['Patient ID'] == i)].size
    index = index + 1

Part of dataframes:

Raw_data

Finding Labels - Patient ID

IllnessA|IllnessB - 1

Illness A - 2

It would be nice to see [sample data](https://stackoverflow.com/q/20109391/1422451) as we do not have access to your users folder. — Parfait, May 04 '19 at 19:16

score 0 · Answer 1 · answered May 04 '19 at 18:02

From what I read I understand that ctr stands for the name of a disease.

When you are doing this query:

raw_data[(raw_data['Finding Labels'].str.contains(ctr)) & (raw_data['Patient ID'] == i)].size

You are not only filtering the rows which have the disease, but also which have a specific patient id. If you have a lot of patients, you will need to do this query a lot of times. A simpler way to do it would be to not filter on the patient id and then take the count of all the rows which have the disease. This would be:

raw_data[raw_data['Finding Labels'].str.contains(ctr)].size

And in this case since you want the number of rows, len is what you are looking for instead of size (size will be the number of cells in the dataframe).

Finally another source of error in your current code was the fact that you were not keeping the count for every patient id. You needed to increment illnesses.at[index, "Count_of_Patientes_Having"] not set it to a new value each time.

The code would be something like (for the last few lines), assuming you want to keep the disease name and the index separate:

for index, ctr in enumerate(data[:1]):
    illnesses.at[index, "Finding_Label"] = ctr
    illnesses.at[index, "Count_of_Times_Being_Shown_In_An_Image"] = len(raw_data[raw_data["Finding Labels"].str.contains(ctr)]) / 12
    illnesses.at[index, "Count_of_Patientes_Having"] = len(raw_data[raw_data['Finding Labels'].str.contains(ctr)])

I took the liberty of using enumerate for a more pythonic way of handling indexes. I also don't really know what "Count_of_Times_Being_Shown_In_An_Image" is, but I assumed you had had the same confusion between size and len.

Parfait · Answer 2 · 2019-05-04T21:42:50.803

Likely the reason your code is slow is that you are growing a data frame row-by-row inside a loop which can involve multiple in-memory copying. Usually this is reminiscent of general purpose Python and not Pandas programming which ideally handles data in blockwise, vectorized processing.

Consider a cross join of your data (assuming a reasonable data size) to the list of illnesses to line up Finding Labels to each illness in same row to be filtered if longer string contains shorter item. Then, run a couple of groupby() to return the count and distinct count by patient.

# CROSS JOIN LIST WITH MAIN DATA FRAME (ALL ROWS MATCHED)
raw_data = (raw_data.assign(key=1)
                    .merge(pd.DataFrame({'ills':ills, 'key':1}), on='key')
                    .drop(columns=['key'])
            )

# SUBSET BY ILLNESS CONTAINED IN LONGER STRING
raw_data = raw_data[raw_data.apply(lambda x: x['ills'] in x['Finding Labels'], axis=1)]

# CALCULATE GROUP BY count AND distinct count
def count_distinct(grp):
    return (grp.groupby('Patient ID').size()).size

illnesses = pd.DataFrame({'Count_of_Times_Being_Shown_In_An_Image': raw_data.groupby('ills').size(),
                          'Count_of_Patients_Having': raw_data.groupby('ills').apply(count_distinct)})

To demonstrate, consider below with random, seeded input data and output.

Input Data (attempting to mirror original data)

import numpy as np
import pandas as pd

alpha = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789'
data_tools = ['sas', 'stata', 'spss', 'python', 'r', 'julia']

ills = ["Cardiomegaly", "Emphysema", "Effusion", "No Finding", "Hernia", 
        "Infiltration", "Mass", "Nodule", "Atelectasis", "Pneumothorax", 
        "Pleural_Thickening", "Pneumonia", "Fibrosis", "Edema", "Consolidation"]

np.random.seed(542019)
raw_data = pd.DataFrame({'Patient ID': np.random.choice(data_tools, 25),
                         'Finding Labels': np.core.defchararray.add(
                                              np.core.defchararray.add(np.array([''.join(np.random.choice(list(alpha), 3)) for _ in range(25)]),
                                                                       np.random.choice(ills, 25).astype('str')),
                                              np.array([''.join(np.random.choice(list(alpha), 3)) for _ in range(25)]))
                         })

print(raw_data.head(10))    
#   Patient ID       Finding Labels
# 0          r   xPNPneumothoraxXYm
# 1     python   ScSInfiltration9Ud
# 2      stata   tJhInfiltrationJtG
# 3          r      thLPneumoniaWdr
# 4      stata    thYAtelectasis6iW
# 5        sas      2WLPneumonia1if
# 6      julia  OPEConsolidationKq0
# 7        sas   UFFCardiomegaly7wZ
# 8      stata         9NQHerniaMl4
# 9     python         NB8HerniapWK

Output (after running above process)

print(illnesses)
#                     Count_of_Times_Being_Shown_In_An_Image  Count_of_Patients_Having
# ills                                                                                
# Atelectasis                                              3                         1
# Cardiomegaly                                             2                         1
# Consolidation                                            1                         1
# Effusion                                                 1                         1
# Emphysema                                                1                         1
# Fibrosis                                                 2                         2
# Hernia                                                   4                         3
# Infiltration                                             2                         2
# Mass                                                     1                         1
# Nodule                                                   2                         2
# Pleural_Thickening                                       1                         1
# Pneumonia                                                3                         3
# Pneumothorax                                             2                         2

Pandas very slow query

2 Answers2