4

I'm wanting to take the following style of dataframe into a dictionary.

Input:

>>>import pandas as pd

>>>df = pd.read_csv('file.csv')
>>>print(df)

   Market  Rep  Name  Date  Amount
0  A1      B1   C1    D1    1
1  A1      B1   C1    D1    2 
2  A1      B1   C1    D2    3
3  A1      B1   C1    D2    4
4  A1      B1   C2    D1    5
5  A1      B1   C2    D1    6
6  A1      B1   C2    D2    7
7  A1      B1   C2    D2    8
8  A1      B2   C3    D1    9
9  A1      B2   C3    D1    10
10 A1      B2   C3    D2    11
11 A1      B2   C3    D2    12
12 A2      B3   C4    D1    13
13 A2      B3   C4    D1    14

Desired Output:


>>> print(associated_data)
{'A1': {'B1': {'C1': {'D1':[1 + 2],
                     {'D2':[3 + 4]},
               'C2': {'D1':[5 + 6],
                      'D2':[7 + 8]}}
       {'B2': {'C3': {'D1':[9 + 10],
                      'D2':[11 + 12]}}},
 'A2': {'B3': {'C4': {'D1':[13 + 14]}}}}

This might not be the best way to go about organizing and sorting the data, so I'm open to suggesting.

I've tried a method I was hoping would work by going through a ton of for loops like this:

# Main function
for market in df['Market'].unique():
    market_data = self.df.loc[self.df['Market'] == market]
    associated_reps = market_data['Rep'].unique()

    # Repeat
    for rep in associated_reps:
        rep_data = market_data.loc[market_data['Rep'] == rep]
        associated_names = rep_data['Name'].unique()

        # Repeat
        for name in associated_names:
            name_data = rep_data.loc[rep_data['Name'] == name]
            associated_dates = name_data['Date'].unique()

            # Repeat
            for date in associated_dates:
                date_data = name_data.loc[name_data['Date'] == date]
                associated_amount = sum(date_data['Amount'].tolist())

                # Attempted solution code (total fail)
                breakdown[market][rep][name][date] = associated_amount

This does appropriately separate all of the data, and at the end attempts to put it all together. I was hoping that you could make a super nested dict like this, but it failed completely (as it turns out that's just not how dicts work unfortunately lmao).

How can you produce the desired output to produce the same result (maybe with shorter sorting code as well)?

Thanks!

Zack Plauché
  • 3,307
  • 4
  • 18
  • 34
  • 3
    Do you actually want the string `['1 + 2']` or the result `[1 + 2]`, a.k.a. `[3]` in your ouput? Or are you trying to get a list of values `[1,2]`? Have you tried `groupby` and `to_dict` (or `to_json`)? – G. Anderson Jul 02 '19 at 16:41
  • @G.Anderson adding for example [1 + 2] together. In this context they're like 2 separate sales. I haven't tried any of those yet (didn't really know about them until now). – Zack Plauché Jul 02 '19 at 19:34

2 Answers2

2

There are similar questions posted, see here for e.g., But this solution below will work.

  1. Set indices for all "categories" in your data, these are the keys in your output dict.
  2. Aggregate on the index to remove duplicate indices
  3. Generate output dict.
import pprint
import numpy as np

def make_dict(ind_vals, d, v):
  """Accumulate index entries as keys in a dict."""
  p = d

  # Get handle on the last but one dict level and make nested dicts if they
  # are not present
  for ix in ind_vals[:-1]:
    # Replace with collection.OrderedDict if necessary.
    p = p.setdefault(ix, {})

  # Set the actual value of interest.
  p[ind_vals[-1]] = v

# Set indices correctly.
df = df.set_index(['Market', 'Rep', 'Name', 'Date'])

# Group values so we don't have duplicate indices
df = df.groupby(level=df.index.names).apply(np.sum)

dct = {}  # Replace with collection.OrderedDict if necessary.
for idx, val in df.iterrows():
  make_dict(idx, dct, val.Amount)
pprint.pprint(dct)
# {'A1': {'B1': {'C1': {'D1': 3, 'D2': 7}, 'C2': {'D1': 11, 'D2': 15}},
#         'B2': {'C3': {'D1': 19, 'D2': 23}}},
#  'A2': {'B3': {'C4': {'D1': 27}}}}
Karthik V
  • 1,867
  • 1
  • 16
  • 23
  • Thank you so much for posting this :) The issue I had with the other one that was shared as a possible duplicate was that there wasn't really a description of why what worked was working. Also, the data was much shorter and was in a list comprehension vs a dictionary. – Zack Plauché Jul 03 '19 at 04:29
1

Iterating through rows + values should work.

dict_values = {}
for idx, row in df.iterrows():
    A, B, C, D, Amount = row
    if A not in dict_values.keys():
        dict_values[A]={}
    if B not in dict_values[A].keys():
        dict_values[A][B]={}
    if C not in dict_values[A][B].keys():
        dict_values[A][B][C]={}
    if D not in dict_values[A][B][C].keys():
        dict_values[A][B][C][D]=[Amount]
    else:
        dict_values[A][B][C][D].append(Amount)
Yaakov Bressler
  • 9,056
  • 2
  • 45
  • 69