100

Say I have a dictionary that looks like this:

dictionary = {'A' : {'a': [1,2,3,4,5],
                     'b': [6,7,8,9,1]},

              'B' : {'a': [2,3,4,5,6],
                     'b': [7,8,9,1,2]}}

and I want a dataframe that looks something like this:

     A   B
     a b a b
  0  1 6 2 7
  1  2 7 3 8
  2  3 8 4 9
  3  4 9 5 1
  4  5 1 6 2

Is there a convenient way to do this? If I try:

In [99]:

DataFrame(dictionary)

Out[99]:
     A               B
a   [1, 2, 3, 4, 5] [2, 3, 4, 5, 6]
b   [6, 7, 8, 9, 1] [7, 8, 9, 1, 2]

I get a dataframe where each element is a list. What I need is a multiindex where each level corresponds to the keys in the nested dict and the rows corresponding to each element in the list as shown above. I think I can work a very crude solution but I'm hoping there might be something a bit simpler.

pbreach
  • 16,049
  • 27
  • 82
  • 120

6 Answers6

104

Pandas wants the MultiIndex values as tuples, not nested dicts. The simplest thing is to convert your dictionary to the right format before trying to pass it to DataFrame:

>>> reform = {(outerKey, innerKey): values for outerKey, innerDict in dictionary.items() for innerKey, values in innerDict.items()}
>>> reform
{('A', 'a'): [1, 2, 3, 4, 5],
 ('A', 'b'): [6, 7, 8, 9, 1],
 ('B', 'a'): [2, 3, 4, 5, 6],
 ('B', 'b'): [7, 8, 9, 1, 2]}
>>> pandas.DataFrame(reform)
   A     B   
   a  b  a  b
0  1  6  2  7
1  2  7  3  8
2  3  8  4  9
3  4  9  5  1
4  5  1  6  2

[5 rows x 4 columns]
BrenBarn
  • 242,874
  • 37
  • 412
  • 384
  • A little tricky to wrap the head around but this is exactly what I was hoping for. – pbreach Jul 28 '14 at 04:12
  • 2
    +1, but this does not work if the length of the dictionary value lists, e.g. 5 in this specific example, do not match. Any ideas how I can tackle this? For instance if `reform= ('A', 'a'): [1, 2, 3, 4, 5], ('A', 'b'): [6, 7, 8, 9,]` – Zhubarb Oct 22 '15 at 14:16
  • 3
    @Zhubarb: What do you expect the resulting DataFrame to look like? A DataFrame has to be rectangular; it can't have columns of different lengths. – BrenBarn Oct 22 '15 at 16:45
  • You are right, I initiall thought of taking the max list length as the row number and padding the lists that fell short, but it is just a crude work-around. Does not make much sense – Zhubarb Oct 22 '15 at 17:09
  • 1
    The columns are randomly ordered in the dataframe, since the dict is unordered. How can the desired order be enforced, short of using an OrderedDict? – Dave Kielpinski Nov 12 '16 at 01:23
  • @DaveKielpinski: You can't enforce the order in the dict. Just get the columns in the order you want after creating the DataFrame, e.g., `DataFrame(...)[["A", "B"]]`. – BrenBarn Nov 12 '16 at 02:03
  • @BrenBarn This gets very nasty when there are multiple labels at each level. – Dave Kielpinski Nov 12 '16 at 02:13
  • @DaveKielpinski You can't enforce the order in the dict. If you want to enforce the order, don't use a dict. Put the data into a list-based format instead, for instance. If you just want to *sort* the columns you can ise `.sort(axis=1)`, but if you have a specific order you need you'll have to do something else. In any case, if you have a question about this, you should ask it as a separate question. – BrenBarn Nov 12 '16 at 02:14
  • @BrenBarn The original question has dataframe column labels in the same order as the presented data, so I think this is relevant. However, the OP accepted your answer so you obviously addressed the OP's intent. – Dave Kielpinski Nov 14 '16 at 19:34
  • 3
    This is great. FYI, this can also be done with `pd.DataFrame.from_dict` if `values` is is in the form of "records": `[{'a': 1, 'b': 2}, {'a': 3, 'b': 4}, ...]` – EliadL Feb 03 '19 at 09:18
35

You're looking for the functionality in .stack:

df = pandas.DataFrame.from_dict(dictionary, orient="index").stack().to_frame()
# to break out the lists into columns
df = pandas.DataFrame(df[0].values.tolist(), index=df.index)
Vira
  • 509
  • 5
  • 8
26
dict_of_df = {k: pd.DataFrame(v) for k,v in dictionary.items()}
df = pd.concat(dict_of_df, axis=1)

Note that the order of columns is lost for python < 3.6

user8227892
  • 261
  • 3
  • 3
  • I am having another problem using this method. from yahoofinancials import YahooFinancials tickers = ['AAPL', 'WFC', 'F', 'FB', 'DELL', 'SNE'] yahoo_financials = YahooFinancials(tickers) BB=yahoo_financials.get_key_statistics_data() dict_of_df = {k: pd.DataFrame(v) for k,v in BB.items()} df = pd.concat(dict_of_df, axis=1) ValueError: If using all scalar values, you must pass an index – rsc05 Nov 09 '19 at 12:12
2

This recursive function should work:

def reform_dict(dictionary, t=tuple(), reform={}):
    for key, val in dictionary.items():
        t = t + (key,)
        if isinstance(val, dict):
            reform_dict(val, t, reform)
        else:
            reform.update({t: val})
        t = t[:-1]
    return reform
madsentail
  • 29
  • 1
2

If lists in the dictionary are not of the same lenght, you can adapte the method of BrenBarn.

>>> dictionary = {'A' : {'a': [1,2,3,4,5],
                         'b': [6,7,8,9,1]},
                 'B' : {'a': [2,3,4,5,6],
                        'b': [7,8,9,1]}}

>>> reform = {(outerKey, innerKey): values for outerKey, innerDict in dictionary.items() for innerKey, values in innerDict.items()}
>>> reform
 {('A', 'a'): [1, 2, 3, 4, 5],
  ('A', 'b'): [6, 7, 8, 9, 1],
  ('B', 'a'): [2, 3, 4, 5, 6],
  ('B', 'b'): [7, 8, 9, 1]}

>>> pandas.DataFrame.from_dict(reform, orient='index').transpose()
>>> df.columns = pd.MultiIndex.from_tuples(df.columns)
   A     B   
   a  b  a  b
0  1  6  2  7
1  2  7  3  8
2  3  8  4  9
3  4  9  5  1
4  5  1  6  NaN
[5 rows x 4 columns]
Dimitri
  • 29
  • 2
0

This solution works for a larger dataframe, it fits what was requested

cols = df.columns
int_cols = len(cols)
col_subset_1 = [cols[x] for x in range(1,int(int_cols/2)+1)]
col_subset_2 = [cols[x] for x in range(int(int_cols/2)+1, int_cols)]

col_subset_1_label = list(zip(['A']*len(col_subset_1), col_subset_1))
col_subset_2_label = list(zip(['B']*len(col_subset_2), col_subset_2))
df.columns = pd.MultiIndex.from_tuples([('','myIndex'),*col_subset_1_label,*col_subset_2_label])

OUTPUT

                        A                      B
     myIndex    a              b          c          d
0   0.159710    1.472925    0.619508    -0.476738   0.866238
1   -0.665062   0.609273    -0.089719   0.730012    0.751615
2   0.215350    -0.403239   1.801829    -2.052797   -1.026114
3   -0.609692   1.163072    -1.007984   -0.324902   -1.624007
4   0.791321    -0.060026   -1.328531   -0.498092   0.559837
5   0.247412    -0.841714   0.354314    0.506985    0.425254
6   0.443535    1.037502    -0.433115   0.601754    -1.405284
7   -0.433744   1.514892    1.963495    -2.353169   1.285580
Rafael MR
  • 193
  • 1
  • 15