5

I have a sorted and nested list. Each element in the list has 3 sub-elements; 'Drugname', Doctor_id, Amount. For a given drugname (which repeats) the doctor ids are different and so are the amounts. See sample list below..

I need an output where, for each drugname, I need to count the total UNIQUE doctor ids and the sum of the dollar amount for that drug. For ex, for the list snippet below..

[
   ['CIPROFLOXACIN HCL', 1801093968, 61.49],
   ['CIPROFLOXACIN HCL', 1588763981, 445.23],
   ['HYDROCODONE-ACETAMINOPHEN', 1801093968, 251.52],
   ['HYDROCODONE-ACETAMINOPHEN', 1588763981, 263.16],
   ['HYDROXYZINE HCL', 1952310666, 945.5],
   ['IBUPROFEN', 1801093968, 67.06],
   ['INVEGA SUSTENNA', 1952310666, 75345.68]
]

The desired output is as below.

[
   ['CIPROFLOXACIN HCL', 2, 516.72],
   ['HYDROCODONE-ACETAMINOPHEN', 2, 514.68]
   ['HYDROXYZINE HCL', 1, 945.5]
   ['IBUPROFEN', 1, 67.06]
   ['INVEGA SUSTENNA', 1, 75345.68]
]

In a database world this is the easiest thing with a simple GROUP BY on drugname. In Python, I am not allowed to use PANDAS, NumPy etc. Just the basic building blocks of Python. I tried the below code but I am unable to reset the count variable to count doctor ids and amounts. This commented code is one of several attempts. Not sure if I need to use a nested for loop or a for loop-while loop combo.

All help is appreciated!

aggr_list = []
temp_drug_name = ''
doc_count = 0
amount = 0
for list_element in sorted_new_list:
    temp_drug_name = list_element[0]
    if temp_drug_name == list_element[0]:
        amount += float(amount)
        doc_count += 1

    aggr_list.append([temp_drug_name, doc_count, amount])

print(aggr_list)

4 Answers4

3

Since the list is already sorted you can simply iterate through the list (named l in the example below) and keep track of the name of the last iteration, and if the name of the current iteration is different from the last, insert a new entry to the output. Use a set to keep track of the doctor IDs already seen for the current drug, and only increment the the second item of the last entry of the output by 1 if the doctor ID is not seen. And increment the third item of the last entry of the output by the amount of the current iteration:

output = []
last = None
for name, id, amount in l:
    if name != last:
        output.append([name, 0, 0])
        last = name
        ids = set()
    if id not in ids:
        output[-1][1] += 1
        ids.add(id)
    output[-1][2] += amount

output becomes:

[['CIPROFLOXACIN HCL', 2, 506.72],
 ['HYDROCODONE-ACETAMINOPHEN', 2, 514.6800000000001],
 ['HYDROXYZINE HCL', 1, 945.5],
 ['IBUPROFEN', 1, 67.06],
 ['INVEGA SUSTENNA', 1, 75345.68]]

Note that decimal floating points are approximated in the binary system that the computer uses (please read Is floating point math broken?), so some minor errors are inevitable as seen in the sum of the second entry above.

blhsing
  • 91,368
  • 6
  • 71
  • 106
  • 1
    Whoa! I was not aware of the 'last' variable element of a list. Thank you! – Chetan Karkhanis Mar 20 '19 at 23:26
  • Actually, after adding the set code, the results are not as expected. It shows ['CIPROFLOXACIN HCL', 1, 506] and ['HYDROCODONE-ACETAMINOPHEN', 1, 515] for the two drugs that should have a doctor_id count of 2. In fact, for all the drugs, the doctor_id count is always showing 1. The previous iteration was giving the right results. – Chetan Karkhanis Mar 21 '19 at 01:15
  • output = [] last = None for name, id, amount in sorted_new_list: if name != last: output.append([name, 0, 0]) last = name ids = set() if id not in ids: ids.add(id) output[-1][1] += 1 output[-1][2] += amount Made a little tweak to the above code. Now it works. – Chetan Karkhanis Mar 21 '19 at 01:43
  • I don't see how my code wouldn't work. Please see a demo here: https://ideone.com/htRNXk Also, I don't see how your tweaked code is any different from mine. You simply swapped the execution of `output[-1][1] += 1` and `ids.add(id)`, in which the order does not matter. – blhsing Mar 21 '19 at 01:58
  • Hi...I looked at the link you provided and for ['CIPROFLOXACIN HCL', 1801093968, 61.49], ['CIPROFLOXACIN HCL', 1588763981, 445.23], ['CIPROFLOXACIN HCL', 1588763981, 123], where you have three elements, the doctor_id shows as 2 in your stdout. My original set had two elements only for this drug as seen in my desired output snippet. – Chetan Karkhanis Mar 21 '19 at 02:43
  • I added that extra entry just to show you that it only counts unique doctor IDs, which for `CIPROFLOXACIN HCL` is correctly 2. Feel free to remove that entry and re-run it. – blhsing Mar 21 '19 at 02:45
  • 1
    Thanks again. I think I got the solution now. Appreciate it! – Chetan Karkhanis Mar 21 '19 at 02:48
0

Here is a solution with a focus on readability, it doesn't consider that the entries in your original list are sorted by drug name.

It does one pass on all the entries of your data , then a pass on the number of unique drugs.

To do only a single pass on all the entries of your sorted data, see @blhsing solution

from collections import defaultdict, namedtuple
Entry = namedtuple('Entry',['doctors', 'prices'])
processed_data = defaultdict(lambda: Entry(doctors=set(), prices=[]))

for entry in data:
    drug_name, doctor_id, price = entry
    processed_data[drug_name].doctors.add(doctor_id)
    processed_data[drug_name].prices.append(price)

stat_list = [[drug_name, len(entry.doctors), sum(entry.prices)] for drug_name, entry in processed_data.items()]
abcdaire
  • 1,528
  • 1
  • 10
  • 21
0

Without Pandas or defaultdict:

    d={}

    for row in l:

       if row[0] in d:
           d[row[0]].append(row[1])
           d[row[0]].append(row[2])
       else:
           d[row[0]]=[row[1]]
           d[row[0]].append(row[2])

    return [[key, len(set(val[0::2])), sum(val[1::2])] for key, val in d.items()]
dejanualex
  • 3,872
  • 6
  • 22
  • 37
0

Reusable solution, meant for those who arrive here trough Google:

def group_by(rows, key):
    m = {}
    for row in rows:
        k = key(row)
        try:
            m[k].append(row)
        except KeyError:
            m[k] = [row]
    return m.values()


grouped_by_drug = group_by(data, key=lambda row: row[0])
result = [
    (
        drug_rows[0][0],
        len(drug_rows),
        sum(row[2] for row in drug_rows)
    )
    for drug_rows in grouped_by_drug
]

You can also use defaultdict in this implementation, which for my use case is slightly faster.

pj.dewitte
  • 472
  • 3
  • 10