29

I have an API that returns a single row of data as a Python dictionary. Most of the keys have a single value, but some of the keys have values that are lists (or even lists-of-lists or lists-of-dictionaries).

When I throw the dictionary into pd.DataFrame to try to convert it to a pandas DataFrame, it throws a "Arrays must be the same length" error. This is because it cannot process the keys which have multiple values (i.e. the keys which have values of lists).

How do I get pandas to treat the lists as 'single values'?

As a hypothetical example:

data = { 'building': 'White House', 'DC?': True,
         'occupants': ['Barack', 'Michelle', 'Sasha', 'Malia'] }

I want to turn it into a DataFrame like this:

ix   building         DC?      occupants
0    'White House'    True     ['Barack', 'Michelle', 'Sasha', 'Malia']
Mr_and_Mrs_D
  • 32,208
  • 39
  • 178
  • 361
Conway
  • 401
  • 1
  • 4
  • 6
  • do you know beforehand the structure of the data coming in? – AbtPst Nov 03 '15 at 16:46
  • 1
    In general, yes. In the hypothetical example, the 'building' will always be a single string, and the 'DC?' will always be a single boolean. But the length of the 'occupants' list may change depending on the building that is queried. Does that answer your question? – Conway Nov 03 '15 at 17:01
  • correct, so i was thinking, create a blank dataframe first and then keep adding rows to it. however, note that as Andy points out, this may be ineffcient – AbtPst Nov 03 '15 at 17:03

6 Answers6

33

This works if you pass a list (of rows):

In [11]: pd.DataFrame(data)
Out[11]:
    DC?     building occupants
0  True  White House    Barack
1  True  White House  Michelle
2  True  White House     Sasha
3  True  White House     Malia

In [12]: pd.DataFrame([data])
Out[12]:
    DC?     building                         occupants
0  True  White House  [Barack, Michelle, Sasha, Malia]
Andy Hayden
  • 359,921
  • 101
  • 625
  • 535
  • 1
    This solution also works with lists-of-lists and lists-of-dictionaries. `data = { 'building': 'White House', 'DC?': True, 'occupants': ['Barack', 'Michelle', 'Sasha', 'Malia'] , 'list_of_lists': [[1,2,3], [4,5,6]], 'list_of_dicts': [{'a': 1, 'b': 2}, {'c': 3, 'd': 4}]}` – Alexander Nov 03 '15 at 17:10
  • It did not work for me where I have 3 lists of the same length for each dictionary key.```nutrients_totals_dict = { "Name" : cnf_nutrient_names_all, "Value": ["0" for i in range(num_values)], "Units": cnf_nutrient_units_all }``` – mLstudent33 Nov 07 '20 at 05:29
6

This turns out to be very trivial in the end

data = { 'building': 'White House', 'DC?': True, 'occupants': ['Barack', 'Michelle', 'Sasha', 'Malia'] }
df = pandas.DataFrame([data])
print df

Which results in:

    DC?     building                         occupants
0  True  White House  [Barack, Michelle, Sasha, Malia]
Chinmay Kanchi
  • 62,729
  • 22
  • 87
  • 114
  • This works for python 3.x. The pandas from_dict behaviour seems to have changed. – BenP Dec 27 '18 at 12:29
  • Hi @ChinmayKanchi, what if there are more than one lines of such data, eg. like 2nd house with another list of [names x multiple] ? – Xavier Sun Jun 02 '20 at 02:27
2

Solution to make dataframe from dictionary of lists where keys become a sorted index and column names are provided. Good for creating dataframes from scraped html tables.

d = { 'B':[10,11], 'A':[20,21] }
df = pd.DataFrame(d.values(),columns=['C1','C2'],index=d.keys()).sort_index()
df

    C1  C2
A   20  21
B   10  11
BSalita
  • 8,420
  • 10
  • 51
  • 68
1

Would it be acceptable if instead of having one entry with a list of occupants, you had individual entries for each occupant? If so you could just do

n = len(data['occupants'])
for key, val in data.items():
    if key != 'occupants':
        data[key] = n*[val]

EDIT: Actually, I'm getting this behavior in pandas (i.e. just with pd.DataFrame(data)) even without this pre-processing. What version are you using?

Tommy Kahn
  • 151
  • 6
0

I had a closely related problem, but my data structure was a multi-level dictionary with lists in the second level dictionary:

result = {'hamster': {'confidence': 1, 'ids': ['id1', 'id2']},
          'zombie': {'confidence': 1, 'ids': ['id3']}}

When importing this with pd.DataFrame([result]), I end up with columns named hamster and zombie. The (for me) correct import would be to have these as row titles, and confidence and ids as column titles. To achieve this, I used pd.DataFrame.from_dict:

In [42]: pd.DataFrame.from_dict(result, orient="index")
Out[42]:
         confidence         ids
hamster           1  [id1, id2]
zombie            1       [id3]

This works for me with python 3.8 + pandas 1.2.3.

Thor
  • 373
  • 1
  • 2
  • 7
-1

if you know the keys of the dictionary beforehand, why not first create an empty data frame and then keep adding rows?

AbtPst
  • 7,778
  • 17
  • 91
  • 172
  • 2
    This is not very efficient as pandas creates a new copy for every insertion (so building up a DataFrame in this way is O(n^2) in number of rows). – Andy Hayden Nov 03 '15 at 16:52