0

I have the following Pandas DataFrame (abbreviated here):

df = pd.DataFrame([
("Distal Lung AT2", 0.4269588779192778, 20),
("Lung Ciliated epithelial cells", 0.28642167657082035, 20),
("Distal Lung AT2",0.4488207834077291,15), 
("Lung Ciliated epithelial cells", 0.27546336897259094, 15),
("Distal Lung AT2", 0.45502553604960105, 10),
("Lung Ciliated epithelial cells", 0.29080413886147555, 10),
("Distal Lung AT2", 0.48481604554028446, 5),
("Lung Ciliated epithelial cells", 0.3178232409599174, 5)],
 columns = ["features", "importance", "num_features"])

I'd like to create a stacked bar plot where the x-axis represents the num_features (so rows with the same num_features should be grouped together), the y axis represents importance, and each bar in the bar plot has blocks colored by features

I tried using plotnine for this, as follows:

plot = (
        ggplot(df, aes(x="num_features", y="importance", fill="features"))
              + geom_bar(stat="identity")
              + xlab("Number of Features")
              + ylab("")
        )

However, when I try to save the plot so I can view it ggsave(plot, os.path.join(figure_path, "stacked_feature_importances.png")), I get:

Traceback (most recent call last):
  File "/home/mdanb/plot_top_features_iteratively.py", line 94, in <module>
    plot_stacked_bar_plots(backwards_elim_dirs)
  File "/home/mdanb/plot_top_features_iteratively.py", line 87, in plot_stacked_bar_plots
    ggsave(plot, os.path.join(figure_path, "stacked_feature_importances.png"))
  File "/home/mdanb/.local/lib/python3.8/site-packages/plotnine/ggplot.py", line 736, in ggsave
    return plot.save(*arg, **kwargs)
  File "/home/mdanb/.local/lib/python3.8/site-packages/plotnine/ggplot.py", line 724, in save
    fig, p = self.draw(return_ggplot=True)
  File "/home/mdanb/.local/lib/python3.8/site-packages/plotnine/ggplot.py", line 203, in draw
    self._build()
  File "/home/mdanb/.local/lib/python3.8/site-packages/plotnine/ggplot.py", line 311, in _build
    layers.compute_position(layout)
  File "/home/mdanb/.local/lib/python3.8/site-packages/plotnine/layer.py", line 79, in compute_position
    l.compute_position(layout)
  File "/home/mdanb/.local/lib/python3.8/site-packages/plotnine/layer.py", line 393, in compute_position
    data = self.position.compute_layer(data, params, layout)
  File "/home/mdanb/.local/lib/python3.8/site-packages/plotnine/positions/position.py", line 56, in compute_layer
    return groupby_apply(data, 'PANEL', fn)
  File "/home/mdanb/.local/lib/python3.8/site-packages/plotnine/utils.py", line 638, in groupby_apply
    lst.append(func(d, *args, **kwargs))
  File "/home/mdanb/.local/lib/python3.8/site-packages/plotnine/positions/position.py", line 54, in fn
    return cls.compute_panel(pdata, scales, params)
  File "/home/mdanb/.local/lib/python3.8/site-packages/plotnine/positions/position_stack.py", line 85, in compute_panel
    trans = scales.y.trans
AttributeError: 'scale_y_discrete' object has no attribute 'trans'

I also looked into trying directly to use Pandas without plotnine, based on this post. However, it doesn't quite address my issue because the bar plot is stacked based on counts, whereas I specifically want to stack it based on values of a column (importance)

An Ignorant Wanderer
  • 1,322
  • 1
  • 10
  • 23
  • 1
    A stacked plot doesnt make sense in your situation, if you are not aiming to display a total column, and split it into parts it wouldnt apply. For example, if you wanted the sum of the importances and you wanted to divide them accordingly to the feature label. Then a stacked bar plot would apply here. I think what you are looking for is multiple side by side bar plots with the corresponding label – INGl0R1AM0R1 Jul 20 '22 at 17:37

1 Answers1

0

The problem is you are using geom_bar, which doesn't expect a y aesthetic, it automatically computes the counts for you based on the x aesthetic you specify.

If you want to specify manually the y, you should use geom_col, which expects both an x and y aesthetic. The default behaviour if you include a fill aesthetic will be to stack the columns, which you could change by specifying position='dodge'.

Using your example:

import plotnine as p9

(p9.ggplot(df)
 + p9.aes(x='num_features', y='importance', fill='features')
 + p9.geom_col())

Output