Reading and Processing Large CSVs with Python

Question

I have a question that is similar in spirit to this previously asked question. Nonetheless, I can't seem to figure out a suitable solution.

Input: I have CSV data that looks like

id,prescriber_last_name,prescriber_first_name,drug_name,drug_cost
1000000001,Smith,James,AMBIEN,100
1000000002,Garcia,Maria,AMBIEN,200
1000000003,Johnson,James,CHLORPROMAZINE,1000
1000000004,Rodriguez,Maria,CHLORPROMAZINE,2000
1000000005,Smith,David,BENZTROPINE MESYLATE,1500

Output: from this I simply need to output each drug, the total cost which is summed over all prescriptions and I need to get a count of the unique number of prescribers.

drug_name,num_prescriber,total_cost
AMBIEN,2,300.0
CHLORPROMAZINE,2,3000.0
BENZTROPINE MESYLATE,1,1500.0

I was able to accomplish this pretty easily with Python. However, when I try to run my code with a much larger (1gb) input, my code does not terminate in a reasonable amount of time.

import sys, csv

def duplicate_id(id, id_list):
    if id in id_list:
        return True
    else:
        return False

def write_file(d, output):
    path = output
    # path = './output/top_cost_drug.txt'
    with open(path, 'w', newline='') as csvfile:
        fieldnames = ['drug_name', 'num_prescriber', 'total_cost']
        writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
        writer.writeheader()
        for key, value in d.items():
            print(key, value)
            writer.writerow({'drug_name': key, 'num_prescriber': len(value[0]), 'total_cost': sum(value[1])})

def read_file(data):
    # TODO: https://codereview.stackexchange.com/questions/88885/efficiently-filter-a-large-100gb-csv-file-v3
    drug_info = {}
    with open(data) as csvfile:
        readCSV = csv.reader(csvfile, delimiter=',')
        next(readCSV)
        for row in readCSV:
            prescriber_id = row[0]
            prescribed_drug = row[3]
            prescribed_drug_cost = float(row[4])

            if prescribed_drug not in drug_info:
                drug_info[prescribed_drug] = ([prescriber_id], [prescribed_drug_cost])
            else:
                if not duplicate_id(prescriber_id, drug_info[prescribed_drug][0]):
                    drug_info[prescribed_drug][0].append(prescriber_id)
                    drug_info[prescribed_drug][1].append(prescribed_drug_cost)
                else:
                    drug_info[prescribed_drug][1].append(prescribed_drug_cost)
    return(drug_info)

def main():
    data = sys.argv[1]
    output = sys.argv[2]
    drug_info = read_file(data)
    write_file(drug_info, output)

if __name__ == "__main__":
    main()

I am having trouble figuring out how to refactor this to handle the larger input and was hoping someone could take a look and provide me some suggestions for how to solve this problem.

I would suggest to profile your code, but just looking at the code I suspect the most of the time the code will be in `duplicate_id()`. But anyway, profile the code https://stackoverflow.com/questions/582336/how-can-you-profile-a-script — Andrej Kesely, Jul 17 '18 at 05:50

score 1 · Accepted Answer · answered Jul 17 '18 at 07:53

IF you can use pandas, Please try the following. Pandas reads your file and store them in dataframe. It is much faster than our manual file processing using iterator.

import pandas as pd


df = pd.read_csv('sample_data.txt')

columns = ['id','drug_name','drug_cost']


df1 = df[columns]
gd = df1.groupby('drug_name')
cnt= gd.count()
s=gd.sum()

out = s.join(cnt,lsuffix='x')
out['total_cost']=out['drug_costx']
out['num_prescriber']=out['drug_cost']
fout = out[['num_prescriber','total_cost']]

fout.to_csv('out_data.csv')

I am getting the following output.

drug_name,num_prescriber,total_cost
AMBIEN,2,300
BENZTROPINE MESYLATE,1,1500
CHLORPROMAZINE,2,3000

Hope this helps.

score 0 · Answer 2 · answered Jul 17 '18 at 06:08

Lists are inefficient at testing membership especially when the length is in the thousands as it costs O(n). Use sets instead to store your prescriber ids, which will reduce the cost of membership tests to O(1).

        if prescribed_drug not in drug_info:
            drug_info[prescribed_drug] = ({prescriber_id}, [prescribed_drug_cost])
        else:
            if prescriber_id not in drug_info[prescribed_drug][0]:
                drug_info[prescribed_drug][0].add(prescriber_id)
                drug_info[prescribed_drug][1].append(prescribed_drug_cost)
            else:
                drug_info[prescribed_drug][1].append(prescribed_drug_cost)

Reading and Processing Large CSVs with Python

2 Answers2