3

I have a Pandas DataFrame looking like:

import pandas as pd
print(pd.__version__)

df0 = pd.DataFrame([
 [12, None, [{'dst': '925', 'object': 'Lok. Certification', 'admin': 'valid'},
             {'dst': '935', 'object': 'Lok. Administration', 'admin': 'true'},
             {'dst': '944', 'object': 'Lok. Customer', 'admin': 'false'},
             {'dst': '945', 'object': 'Lok. Customer', 'admin': 'false'},
             {'dst': '954', 'object': 'Lok. Certification-C', 'admin': 'invalid'},
             {'dst': '956', 'object': 'Lok. Certification', 'admin': 'valid'}]],
 [13,'wXB', [{'dst': '986', 'object': 'Fral_heater', 'admin': 'valid'},
             {'dst': '987', 'object': 'Fral_cond.', 'admin': 'valid'}]],
 ])

Each of the list in the column 2 is having the exact same keys (dst, object and admin).

There can be between 0 (empty []) and 100 lists for each row of the df0.

I wish I could expand the df0 DataFrame to look like this:

columns = ['id', 'name', 'dst', 'object', 'admin']

df_wanted
Out[416]: 
     id name  dst  object                admin
    12  None  925 'Lok. Certification'   'valid'
    12  None  935 'Lok. Administration'  'true'
    12  None  944 'Lok. Customer'        'false'
    12  None  945 'Lok. Customer'        'false'
    12  None  955 'Lok. Certification-C' 'invalid'
    12  None  956 'Lok. Certification'   'valid'
    13   wXB  987 'Lok. Fral_heater'     'valid'
    13   wXB  986 'Lok. Fral_cond.'      'valid'
    ...

Notice that the two first columns, id and name, are replicated along the rows to fit the number of elements within their list.

(The dst column must be cast to an int using .astype(int) at the end.)

How could I achieve that?

Info:

Python 3.10.4
pd.__version__
'1.4.2'
swiss_knight
  • 5,787
  • 8
  • 50
  • 92
  • 1
    It looks like you could explode the lists then turn the dicts into columns. There are existing questions about that, like [How to unnest (explode) a column into multiple rows](/q/53218931/4518341) and [Split / Explode a column of dictionaries into separate columns](/q/38231591/4518341) (in this one, note that OP's data is JSON, not dicts per se, but the solutions are similar). – wjandrea May 08 '22 at 17:06

4 Answers4

3

You can explode the column first, then convert the dictionaries to columns:

df0 = df0.explode(2, ignore_index=True)    
df0 = pd.concat([df0, df0[2].apply(pd.Series)], axis=1).drop(columns=2)
Rawson
  • 2,637
  • 1
  • 5
  • 14
  • KeyError: 'dst' – swiss_knight May 08 '22 at 17:05
  • Which line is the error in? If the second, can you try the individual parts of the line to see which bit is causing it, i.e. `df["list_of_dictionaries"].apply(pd.Series)`, `df["list_of_dictionaries"].apply(pd.Series)`,... I can't try this with your data myself, as I don't have it. But with the three lines you have given, I am not finding any errors. – Rawson May 08 '22 at 17:11
  • Ah, I have changed `df` to `df0`. I had created my own dataframe and not noticed the variation in name. – Rawson May 08 '22 at 17:30
  • No worries, but then you should specify `df0[2]` coz the list of dict is in the third column of the original DataFrame. And it has no name. So it won't recognize the key on that original DataFrame.... – swiss_knight May 08 '22 at 17:31
  • Yes, changed that too. Hopefully everything works now. – Rawson May 08 '22 at 17:33
3

The other answers both work well. Instead of building Series objects line by line, you could build a single DataFrame object and join back to the original. This one should be a bit faster.

df1 = df0.explode(2, ignore_index=True).pipe(lambda x: x.join(pd.DataFrame(x.pop(2).tolist())))

Output:

    0     1  dst                object    admin
0  12  None  925    Lok. Certification    valid
1  12  None  935   Lok. Administration     true
2  12  None  944         Lok. Customer    false
3  12  None  945         Lok. Customer    false
4  12  None  954  Lok. Certification-C  invalid
5  12  None  956    Lok. Certification    valid
6  13   wXB  986           Fral_heater    valid
7  13   wXB  987            Fral_cond.    valid

Benchmark:

>>> %timeit df1 = df0.explode(2, ignore_index=True); df1 = pd.concat([df1, df1[2].apply(pd.Series)], axis=1).drop(columns=2)
8.4 ms ± 842 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

>>> %timeit df1 = df0.explode(2, ignore_index=True).pipe(lambda x: x.join(pd.DataFrame(x.pop(2).tolist())))
4.8 ms ± 565 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
2

I'd recommend expanding the column into a new df, then joining back to the main df:

# Copied from above
df = pd.DataFrame([
 [12, None, [{'dst': '925', 'object': 'Lok. Certification', 'admin': 'valid'},
             {'dst': '935', 'object': 'Lok. Administration', 'admin': 'true'},
             {'dst': '944', 'object': 'Lok. Customer', 'admin': 'false'},
             {'dst': '945', 'object': 'Lok. Customer', 'admin': 'false'},
             {'dst': '954', 'object': 'Lok. Certification-C', 'admin': 'invalid'},
             {'dst': '956', 'object': 'Lok. Certification', 'admin': 'valid'}]],
 [13,'wXB', [{'dst': '986', 'object': 'Fral_heater', 'admin': 'valid'},
             {'dst': '987', 'object': 'Fral_cond.', 'admin': 'valid'}]],
 ])

# Set the names of the columns
df.columns = ['id', 'name', 'object']

 # Create a new df from the column
df_tmp = df['object'].explode().apply(pd.Series)

# Join to original
df = pd.concat([df[['id', 'name']], df_tmp], axis=1).reset_index(drop=True)

# Result:
|    |   id | name   |   dst | object               | admin   |
|---:|-----:|:-------|------:|:---------------------|:--------|
|  0 |   12 |        |   925 | Lok. Certification   | valid   |
|  1 |   12 |        |   935 | Lok. Administration  | true    |
|  2 |   12 |        |   944 | Lok. Customer        | false   |
|  3 |   12 |        |   945 | Lok. Customer        | false   |
|  4 |   12 |        |   954 | Lok. Certification-C | invalid |
|  5 |   12 |        |   956 | Lok. Certification   | valid   |
|  6 |   13 | wXB    |   986 | Fral_heater          | valid   |
|  7 |   13 | wXB    |   987 | Fral_cond.           | valid   |
Yaakov Bressler
  • 9,056
  • 2
  • 45
  • 69
0

Pull out column 2 and create a dataframe from it:

mix = df0.pop(2)
lengths = mix.str.len()
mix = pd.DataFrame(chain.from_iterable(mix))

Expand df0 with lengths:

df0 = df0.loc[df0.index.repeat(lengths)]
df0.index = range(len(df0))

Combine both dataframes:

pd.concat([df0, mix], axis = 1)

    0     1  dst                object    admin
0  12  None  925    Lok. Certification    valid
1  12  None  935   Lok. Administration     true
2  12  None  944         Lok. Customer    false
3  12  None  945         Lok. Customer    false
4  12  None  954  Lok. Certification-C  invalid
5  12  None  956    Lok. Certification    valid
6  13   wXB  986           Fral_heater    valid
7  13   wXB  987            Fral_cond.    valid

If you want more speed, you can dump down into numpy, build everything as a dictionary, before creating a new dataframe (this is useful if you have a large data and are concerned with memory usage/performance)

sammywemmy
  • 27,093
  • 4
  • 17
  • 31