-1

I have a list of patient id and drug names and a list of patient id and disease names. I want to find the most indicative drug for each disease.

To find this I want to do Fisher exact test to get the p-value for each disease/drug pair. But the loop runs very slowly, more than 10 hours. Is there a way to make the loop more efficient, or a better way to solve this association problem?

My loop:

import numpy as np
import pandas as pd
from scipy.stats import fisher_exact 

most_indicative_medication = {}
rx_list = list(meps_meds.rxName.unique()) 
disease_list = list(meps_base_data.columns.values)[8:]

for i in disease_list:
    print i
    rx_dict = {}
    for j in rx_list: 
        subset = base[['id', i, 'rxName']].drop_duplicates()
        subset[j] = subset['rxName'] == j
        subset = subset.loc[subset[i].isin(['Yes', 'No'])]
        subset = subset[[i, j]]
        tab = pd.crosstab(subset[i], subset[j]) 
        if len(tab.columns) == 2:
            rx_dict[j] = fisher_exact(tab)[1]
        else: 
            rx_dict[j] = np.nan
    most_indicative_medication[i] = min(rx_dict, key=rx_dict.get)
Hugh Bothwell
  • 55,315
  • 8
  • 84
  • 99
Gina Sun
  • 1
  • 1
  • 1
    Would it be possible to post a small sample of the data? https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples – Alexander Aug 21 '17 at 00:47
  • The fact that you are working with `pandas` and `numpy` is more important that you are working with loops. I changed your tags accordingly. – hpaulj Aug 21 '17 at 00:52
  • 1
    Printing to console every iteration will slow it down much more than you'd think. If you absolutely must see the output, then i'd suggest writing to a file. https://stackoverflow.com/questions/13288185/performance-effect-of-using-print-statements-in-python-script – shockawave123 Aug 21 '17 at 00:56
  • 1
    Have you done a "back-of-the-napkin" calculation of how many times you're iterating? (`len(diseases_list) * len(rx_list)`). Have you used the profiler? – thebjorn Aug 21 '17 at 01:04
  • From context looks like data may be MEPS Prescribed Medicines File, https://meps.ahrq.gov/mepsweb/data_stats/download_data_files_results.jsp?cboDataYear=All&cboDataTypeY=2%2CHousehold+Event+File&buttonYearandDataType=Search&cboPufNumber=All&SearchTitle=Prescribed+Medicines – Hugh Bothwell Aug 21 '17 at 01:40
  • Looks like same data as .csv at https://github.com/Saynah/platform/archive/d7e9f150ef2ff436387585960ca312a301847a46.zip (13.1 MB download, meps_meds.csv and meps_base_data.csv in /data), online notepad and examples at https://github.com/Saynah/platform/blob/d7e9f150ef2ff436387585960ca312a301847a46/inspect.ipynb ) – Hugh Bothwell Aug 21 '17 at 01:57

2 Answers2

1

You need multiprocessing/multithreading, I have added the code.:

from multiprocessing.dummy import Pool as ThreadPool
most_indicative_medication = {}
rx_list = list(meps_meds.rxName.unique()) 
disease_list = list(meps_base_data.columns.values)[8:]

def run_pairwise(i):
    print i
    rx_dict = {}
    for j in rx_list: 
        subset = base[['id', i, 'rxName']].drop_duplicates()
        subset[j] = subset['rxName'] == j
        subset = subset.loc[subset[i].isin(['Yes', 'No'])]
        subset = subset[[i, j]]
        tab = pd.crosstab(subset[i], subset[j]) 
        if len(tab.columns) == 2:
            rx_dict[j] = fisher_exact(tab)[1]
        else: 
            rx_dict[j] = np.nan
    most_indicative_medication[i] = min(rx_dict, key=rx_dict.get)

pool = ThreadPool(3)
pairwise_test_results = pool.map(run_pairwise,disease_list)
pool.close()
pool.join()

notes:http://chriskiehl.com/article/parallelism-in-one-line/

Mo. Atairu
  • 753
  • 8
  • 15
  • This is exactly what I need. Thank you so much! Btw, is there a limit of threadpool count? I assume the more the faster? – Gina Sun Aug 21 '17 at 02:29
0

Crunching faster is good, but a better algorithm usually beats it ;-)

Filling in a bit,

import numpy as np
import pandas as pd
from scipy.stats import fisher_exact

# data files can be download at
# https://github.com/Saynah/platform/tree/d7e9f150ef2ff436387585960ca312a301847a46/data
meps_meds      = pd.read_csv("meps_meds.csv")               #  8 cols * 1,148,347 rows
meps_base_data = pd.read_csv("meps_base_data.csv")          # 18 columns * 61,489 rows

# merge to get disease and drug info in same table
merged = pd.merge(                                          # 25 cols * 1,148,347 rows
    meps_base_data, meps_meds,
    how='inner', left_on='id', right_on='id'
)

rx_list        = meps_meds.rxName.unique().tolist()         # 9218 items
disease_list   = meps_base_data.columns.values[8:].tolist() # 10 items

Note that rx_list has a LOT of duplicates (eg. 52 entries for Amoxicillin, if you include misspellings).

Then

most_indicative = {}

for disease in disease_list:
    # get unique (id, disease, prescription)
    subset = merged[['id', disease, 'rxName']].drop_duplicates()
    # keep only Yes/No entries under disease
    subset = subset[subset[disease].isin(['Yes', 'No'])]
    # summarize (replaces the inner loop)
    tab = pd.crosstab(subset.rxName, subset[disease])

    # bind "No" values for Fisher exact function
    nf, yf = tab.sum().values
    def p_value(x, nf=nf, yf=yf):
        return fisher_exact([[nf - x.No, x.No], [yf - x.Yes, x.Yes]])[1]

    # OPTIONAL:
    # We can probably assume that the most-indicative drugs are among
    #  the most-prescribed; get just the 100 most-prescribed drugs
    # Note: you have to get the nf, yf values before doing this!
    tab = tab.sort_values("Yes", ascending=False)[:100]

    # and apply the function
    tab["P-value"] = tab.apply(p_value, axis=1)

    # find the best match
    best_med = tab.sort_values("P-value").index[0]
    most_indicative[disease] = best_med

This now runs in about 18 minutes on my machine, and you could probably combine it with EM28's answer to speed it up by another factor of 4 or more.

Hugh Bothwell
  • 55,315
  • 8
  • 84
  • 99