I have a question that is similar in spirit to this previously asked question. Nonetheless, I can't seem to figure out a suitable solution.
Input: I have CSV data that looks like
id,prescriber_last_name,prescriber_first_name,drug_name,drug_cost
1000000001,Smith,James,AMBIEN,100
1000000002,Garcia,Maria,AMBIEN,200
1000000003,Johnson,James,CHLORPROMAZINE,1000
1000000004,Rodriguez,Maria,CHLORPROMAZINE,2000
1000000005,Smith,David,BENZTROPINE MESYLATE,1500
Output: from this I simply need to output each drug, the total cost which is summed over all prescriptions and I need to get a count of the unique number of prescribers.
drug_name,num_prescriber,total_cost
AMBIEN,2,300.0
CHLORPROMAZINE,2,3000.0
BENZTROPINE MESYLATE,1,1500.0
I was able to accomplish this pretty easily with Python. However, when I try to run my code with a much larger (1gb) input, my code does not terminate in a reasonable amount of time.
import sys, csv
def duplicate_id(id, id_list):
if id in id_list:
return True
else:
return False
def write_file(d, output):
path = output
# path = './output/top_cost_drug.txt'
with open(path, 'w', newline='') as csvfile:
fieldnames = ['drug_name', 'num_prescriber', 'total_cost']
writer = csv.DictWriter(csvfile, fieldnames=fieldnames)
writer.writeheader()
for key, value in d.items():
print(key, value)
writer.writerow({'drug_name': key, 'num_prescriber': len(value[0]), 'total_cost': sum(value[1])})
def read_file(data):
# TODO: https://codereview.stackexchange.com/questions/88885/efficiently-filter-a-large-100gb-csv-file-v3
drug_info = {}
with open(data) as csvfile:
readCSV = csv.reader(csvfile, delimiter=',')
next(readCSV)
for row in readCSV:
prescriber_id = row[0]
prescribed_drug = row[3]
prescribed_drug_cost = float(row[4])
if prescribed_drug not in drug_info:
drug_info[prescribed_drug] = ([prescriber_id], [prescribed_drug_cost])
else:
if not duplicate_id(prescriber_id, drug_info[prescribed_drug][0]):
drug_info[prescribed_drug][0].append(prescriber_id)
drug_info[prescribed_drug][1].append(prescribed_drug_cost)
else:
drug_info[prescribed_drug][1].append(prescribed_drug_cost)
return(drug_info)
def main():
data = sys.argv[1]
output = sys.argv[2]
drug_info = read_file(data)
write_file(drug_info, output)
if __name__ == "__main__":
main()
I am having trouble figuring out how to refactor this to handle the larger input and was hoping someone could take a look and provide me some suggestions for how to solve this problem.