2

I create an OrderedDict:

from collections import OrderedDict

od = OrderedDict([((2, 9), 0.5218),
  ((2, 0), 0.3647),
  ((3, 15), 0.3640),
  ((3, 8), 0.3323),
  ((2, 28), 0.3310),
  ((2, 15), 0.3281),
  ((2, 10), 0.2938),
  ((3, 9), 0.2719)])

Then I feed that into the pandas DataFrame constructor:

import pandas as pd

df = pd.DataFrame({'values': od})

the result is this:

enter image description here

instead it should give this:

enter image description here

What is going on here that I don't understand?

P.S.: I am not looking for an alternative way to solving the problem (though you are welcome to post it if you think it would help the community). All I want is to understand why this here doesn't work. Is it a bug, or is there some logic to it? This is also not a duplicate of this link, because i am using specifically an OrderedDict and not a normal dict.

Jim
  • 1,579
  • 1
  • 11
  • 18
  • Reading the source code, [init_dict](https://github.com/pandas-dev/pandas/blob/master/pandas/core/frame.py#L460) does not modify the order of the arrays being passed, among many other checks that don't apply to your case, it extracts the column names from the dictionary keys. Then the constructor calls [NDFrame.__init__](https://github.com/pandas-dev/pandas/blob/dca6c7f43d113b4aca1e82094e2af0d82612abed/pandas/core/generic.py#L166), if it helps anyone who wants to pick the research up from that point. – RichieV Sep 27 '20 at 20:44

1 Answers1

3

If you want to get the DataFrame in the same order as your dictionary you can

df = pd.DataFrame(od.values(), index=od.keys(), columns=['values'])

Output

      values
2 9   0.5218
  0   0.3647
3 15  0.3640
  8   0.3323
2 28  0.3310
  15  0.3281
  10  0.2938
3 9   0.2719

The only mention of OrderedDict in the frame source code is for an example of df.to_dict(), so not useful here.

It seems that even though you are passing an ordered structure, it is being parsed and re-ordered by default once you wrap it in a common dictionary {'values': od} and pandas takes its index from the OrderedDict.

This behavior seems to be overruled if you build your dictionary with the column labels as well (à la json).

od = OrderedDict([
    ((2, 9), {'values':0.5218}),
    ((2, 0), {'values':0.3647}),
    ((3, 15), {'values':0.3640}),
    ((3, 8), {'values':0.3323}),
    ((2, 28), {'values':0.3310}),
    ((2, 15), {'values':0.3281}),
    ((2, 10), {'values':0.2938}),
    ((3, 9), {'values':0.2719})
])
df = pd.DataFrame(od).T
print(df)
      values
2 9   0.5218
  0   0.3647
3 15  0.3640
  8   0.3323
2 28  0.3310
  15  0.3281
  10  0.2938
3 9   0.2719
RichieV
  • 5,103
  • 2
  • 11
  • 24
  • 1
    As to why this happens, you would have to follow the parsing classes in that source code, hopefully some will already has done that and can enlighten us. – RichieV Sep 27 '20 at 19:35
  • What do you mean by wrap it in a common dictionary? – Jim Sep 27 '20 at 19:57
  • is there a way to not wrap it in a common dictionary? Something like `df = pd.DataFrame(data=od, columns=['values'])` (which doesn't work). `TypeError: Expected tuple, got str` – Jim Sep 27 '20 at 20:03
  • This here doesn't work either: `df = pd.DataFrame(od)`. It gives the error: `ValueError: If using all scalar values, you must pass an index` – Jim Sep 27 '20 at 20:04
  • `{'values': od}` is a plain dictionary, which has only one item with key `'values'` and value `od` – RichieV Sep 27 '20 at 20:06
  • Actually starting with Python 3.7, normal dicts are ordered too. "Python 3.7 elevates this implementation detail to a language specification, so it is now mandatory that dict preserves order in all Python implementations compatible with that version or newer" https://stackoverflow.com/questions/1867861/how-to-keep-keys-values-in-same-order-as-declared – Jim Sep 27 '20 at 20:48
  • And it is not like, that pandas gives back the DataFrame in a random row order. It sorts it. So I don't believe that the property dict vs. OrderedDict matters. It just sorts it, period. For whatever reason. Maybe it's somewhere in the specification, but i can't find it. Pandas behaves just so erratic sometimes. It's frustrating. – Jim Sep 27 '20 at 20:50
  • 1
    I actually think it is quite reasonable to sort the index, since it is the default behavior in many methods like `unstack`, `reindex` and others... if it helps, `init_dict` does not sort the data, it must be done further down the stack. But I wouldn't worry about it and just use whatever works, unless you want to learn the actual pandas implementations for your own personal benefit, which is not a bad idea if you have the time. – RichieV Sep 27 '20 at 20:55
  • from my experience `reindex` doesn't sort, i mean what's the point of reindexing then? In fact, I used reindex to sort my dataframe rows according to metric that i computed before and it worked. – Jim Sep 27 '20 at 21:00