2

I want to plot a plotly.express.area diagram with px.area(df, x="date", y="count_cum", color="topic")

This is the way the df is looking:

           date                        topic  count_cum  count
0    2021-03-05                      topic_1          1      1
1    2021-03-05                      topic_2          1      1
2    2021-03-06                      topic_1          2      1
3    2021-03-07                      topic_1          3      1
4    2021-03-07                      topic_2          2      1

The problem is that the cumulated column of topic_2 is not present on 2021-03-06 and the graph of topic_2 falls to 0 at that day. Is there a way to prevent this? Or if not how can I adjust the df?

I think the target should be like this if plotly can't do:

           date                        topic  count_cum  count
0    2021-03-05                      topic_1          1      1
1    2021-03-05                      topic_2          1      1
2    2021-03-06                      topic_1          2      1
3    2021-03-06                      topic_2          1      0
4    2021-03-07                      topic_1          3      1
5    2021-03-07                      topic_2          2      1
Berni
  • 21
  • 3
  • Please elaborate a bit on your dataset here. You're saying that `topic_2` is ***not*** present in your real world dataset? Then why ***is*** it present in your datasample? Am I missing something here? It would be a lot easier to provide fruitgful suggestions if you would share a sample of your dataset as described [here](https://stackoverflow.com/questions/63163251/pandas-how-to-easily-share-a-sample-dataframe-using-df-to-dict/63163254#63163254) – vestland Sep 20 '21 at 22:05

1 Answers1

0

you can create an interpolation of the missing values with dataframe.interpolate() eg:

df = pd.DataFrame([(0.0, np.nan, -1.0, 1.0),
                   (np.nan, 2.0, np.nan, np.nan),
                   (2.0, 3.0, np.nan, 9.0),
                   (np.nan, 4.0, -4.0, 16.0)],
                  columns=list('abcd'))

df
     a    b    c     d
0  0.0  NaN -1.0   1.0
1  NaN  2.0  NaN   NaN
2  2.0  3.0  NaN   9.0
3  NaN  4.0 -4.0  16.0

df.interpolate(method='linear', limit_direction='forward', axis=0)

     a    b    c     d
0  0.0  NaN -1.0   1.0
1  1.0  2.0 -2.0   5.0
2  2.0  3.0 -3.0   9.0
3  2.0  4.0 -4.0  16.0

here is the official documentation:

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.interpolate.html

here is a tutorial:

https://www.geeksforgeeks.org/python-pandas-dataframe-interpolate/#:~:text=interpolate()%20function%20is%20basically,than%20hard%2Dcoding%20the%20value.

Note that this will be an approximation of the expected real value, and this can lead to misinterpretation and wrong analysis of the plot.

Guinther Kovalski
  • 1,629
  • 1
  • 7
  • 15