How to create box plots from columns of dicts in pandas

Question

I have a dataframe where each row is a dictionary on which I'd like to use seaborn's horizontal box plot.
1. The x axis should be the float values for each 'dialog'
2. The y axis should show the 4 different models
3. There should be a plot for each parts of speech, meaning there should be a graph for 'INTJ', another for 'ADV' and so on.
I'm thinking I'll have to do a pd.melt first to restructure the data first so that the new columns would be 'dialog_num', 'model_type', and 'value' (automatic variable name after doing a melt, but basically the rows of dictionaries).
After that, perhaps break the 'value' variable so that each column is a part of speech ('ADV', 'INTJ', 'VERB', etc.) (this part seems tricky to me). Past this point...do a for loop on all of the columns and apply the horizontal boxplot?

import pandas as pd

pos =\
{'dialog_num': {0: 0, 1: 1, 2: 2},
 'model1': {0: {'ADV': 0.072, 'INTJ': 0.03, 'PRON': 0.133, 'VERB': 0.109},
            1: {'ADJ': 0.03, 'NOUN': 0.2, 'PRON': 0.13},
            2: {'ADV': 0.083, 'PRON': 0.125, 'VERB': 0.0625}},
 'model2': {0: {'ADJ': 0.1428, 'ADV': 0.1428, 'AUX': 0.1428, 'INTJ': 0.285},
            1: {'ADJ': 0.1, 'DET': 0.1, 'NOUN': 0.1, 'PROPN': 0.1, 'VERB': 0.2},
            2: {'CCONJ': 0.166, 'NOUN': 0.333, 'SPACE': 0.166, 'VERB': 0.3333}},
 'model3': {0: {'ADJ': 0.06, 'CCONJ': 0.06, 'NOUN': 0.2, 'PRON': 0.266, 'SPACE': 0.066, 'VERB': 0.333},
            1: {'AUX': 0.15, 'PRON': 0.25, 'PUNCT': 0.15, 'VERB': 0.15},
            2: {'ADP': 0.125, 'PRON': 0.0625, 'PUNCT': 0.0625, 'VERB': 0.25}},
 'model4': {0: {'ADJ': 0.25, 'ADV': 0.08, 'CCONJ': 0.083, 'PRON': 0.166},
            1: {'AUX': 0.33, 'PRON': 0.2, 'VERB': 0.0667},
            2: {'CCONJ': 0.125, 'NOUN': 0.125, 'PART': 0.125, 'PRON': 0.125, 'SPACE': 0.125, 'VERB': 0.375}}}
df = pd.DataFrame.from_dict(pos)

display(df)
   dialog_num                                                      model1                                                            model2                                                                                   model3                                                                                        model4
0           0  {'INTJ': 0.03, 'ADV': 0.072, 'PRON': 0.133, 'VERB': 0.109}      {'INTJ': 0.285, 'AUX': 0.1428, 'ADV': 0.1428, 'ADJ': 0.1428}  {'PRON': 0.266, 'VERB': 0.333, 'ADJ': 0.06, 'NOUN': 0.2, 'CCONJ': 0.06, 'SPACE': 0.066}                                     {'PRON': 0.166, 'ADV': 0.08, 'ADJ': 0.25, 'CCONJ': 0.083}
1           1                    {'PRON': 0.13, 'ADJ': 0.03, 'NOUN': 0.2}  {'PROPN': 0.1, 'VERB': 0.2, 'DET': 0.1, 'ADJ': 0.1, 'NOUN': 0.1}                                 {'PRON': 0.25, 'AUX': 0.15, 'VERB': 0.15, 'PUNCT': 0.15}                                                    {'PRON': 0.2, 'AUX': 0.33, 'VERB': 0.0667}
2           2               {'PRON': 0.125, 'ADV': 0.083, 'VERB': 0.0625}   {'VERB': 0.3333, 'CCONJ': 0.166, 'NOUN': 0.333, 'SPACE': 0.166}                            {'PRON': 0.0625, 'VERB': 0.25, 'PUNCT': 0.0625, 'ADP': 0.125}  {'PRON': 0.125, 'VERB': 0.375, 'PART': 0.125, 'CCONJ': 0.125, 'NOUN': 0.125, 'SPACE': 0.125}

Trenton McKinney · Accepted Answer · 2022-05-27T19:04:30.760

sns.boxplot expects data to be supplied in a long form when specifying x= and y=.
In this case, based on the specifications of having each speech type as a separate plot, sns.catplot will be used because there is a col= parameter, which can be used to create separate plots for speech types.

As mentioned in the OP, use .melt to unpivot the wide dataframe.
.json_normalize can be used to convert the the 'value' column (dict type) into a flat table.
- See Split / Explode a column of dictionaries into separate columns with pandas if there are issues with this step.
Join the flattened table (vals) to dfm with .join.
- This works because vals and dfm have matching indices.
.melt the dataframe again.
Plot the box plot from the long form dataframe.

Tested in python 3.10, pandas 1.4.2, matplotlib 3.5.1, seaborn 0.11.2

import pandas as pd
import seaborn as sns

# load the dict into a dataframe
df = pd.DataFrame(pos)

# unpivot the dataframe
dfm = df.melt(id_vars='dialog_num', var_name='model')

# convert the 'value' column of dicts to a flat table
vals = pd.json_normalize(dfm['value'])

# combine vals to dfm, without the 'value' column
dfm = dfm.iloc[:, 0:-1].join(vals)

# unpivot the dataframe again
dfm = dfm.melt(id_vars=['dialog_num', 'model'])

plot all of the speech types together

p = sns.boxplot(data=dfm, x='value', y='model')

plot speech types separately

Most speech types have only a single value, or no values.

p = sns.catplot(kind='box', data=dfm, x='value', y='model', col='variable', col_wrap=4, height=4)

DataFrames at each step

1: `dfm.head()`

   dialog_num   model                                                             value
0           0  model1        {'INTJ': 0.03, 'ADV': 0.072, 'PRON': 0.133, 'VERB': 0.109}
1           1  model1                          {'PRON': 0.13, 'ADJ': 0.03, 'NOUN': 0.2}
2           2  model1                     {'PRON': 0.125, 'ADV': 0.083, 'VERB': 0.0625}
3           0  model2      {'INTJ': 0.285, 'AUX': 0.1428, 'ADV': 0.1428, 'ADJ': 0.1428}
4           1  model2  {'PROPN': 0.1, 'VERB': 0.2, 'DET': 0.1, 'ADJ': 0.1, 'NOUN': 0.1}

2: `vals.head()`

    INTJ     ADV   PRON    VERB     ADJ  NOUN     AUX  PROPN  DET  CCONJ  SPACE  PUNCT  ADP  PART
0  0.030  0.0720  0.133  0.1090     NaN   NaN     NaN    NaN  NaN    NaN    NaN    NaN  NaN   NaN
1    NaN     NaN  0.130     NaN  0.0300   0.2     NaN    NaN  NaN    NaN    NaN    NaN  NaN   NaN
2    NaN  0.0830  0.125  0.0625     NaN   NaN     NaN    NaN  NaN    NaN    NaN    NaN  NaN   NaN
3  0.285  0.1428    NaN     NaN  0.1428   NaN  0.1428    NaN  NaN    NaN    NaN    NaN  NaN   NaN
4    NaN     NaN    NaN  0.2000  0.1000   0.1     NaN    0.1  0.1    NaN    NaN    NaN  NaN   NaN

3: `dfm.head()`

   dialog_num   model   INTJ     ADV   PRON    VERB     ADJ  NOUN     AUX  PROPN  DET  CCONJ  SPACE  PUNCT  ADP  PART
0           0  model1  0.030  0.0720  0.133  0.1090     NaN   NaN     NaN    NaN  NaN    NaN    NaN    NaN  NaN   NaN
1           1  model1    NaN     NaN  0.130     NaN  0.0300   0.2     NaN    NaN  NaN    NaN    NaN    NaN  NaN   NaN
2           2  model1    NaN  0.0830  0.125  0.0625     NaN   NaN     NaN    NaN  NaN    NaN    NaN    NaN  NaN   NaN
3           0  model2  0.285  0.1428    NaN     NaN  0.1428   NaN  0.1428    NaN  NaN    NaN    NaN    NaN  NaN   NaN
4           1  model2    NaN     NaN    NaN  0.2000  0.1000   0.1     NaN    0.1  0.1    NaN    NaN    NaN  NaN   NaN

4: `dfm.head()`

   dialog_num   model variable  value
0           0  model1     INTJ  0.030
1           1  model1     INTJ    NaN
2           2  model1     INTJ    NaN
3           0  model2     INTJ  0.285
4           1  model2     INTJ    NaN

How to create box plots from columns of dicts in pandas

1 Answers1

plot all of the speech types together

plot speech types separately

DataFrames at each step

1: dfm.head()

2: vals.head()

3: dfm.head()

4: dfm.head()

1: `dfm.head()`

2: `vals.head()`

3: `dfm.head()`

4: `dfm.head()`