2

I have data for a whole year with an interval of fifteen minutes and want to create a histogram counting hours and not fifteen minutes.

Toy example code

I have following toy example code

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

df = pd.read_csv(r"D:/tmp/load.csv")
df.hist(bins=range(20,80,5))
sns.set()
plt.xlabel("Value")
plt.ylabel("count")
plt.show()

Which produces the following graph.

histogram

The data within the DataFrame is of the form:

>>> df[(df["Time"] > "2021-04-10 19:45:00") & (df["Time"] < "2021-04-10 21:00:00")]
                     Time      tag
9584  2021-04-10 20:00:00  50.3840
9585  2021-04-10 20:15:00  37.8332
9586  2021-04-10 20:30:00  36.6808
9587  2021-04-10 20:45:00  37.1840

Expected result

I need to change the y-axis values on the histogram so as to see the hours count and not the fifteen minutes count. So for the first column I should see 10 (40/4) instead of 40. So the whole y-axis should be divided by 4.

Question

How can I perform the scaling of the y-axis in the histogram? Should I work with plt.yticks function somehow?

Cedric Zoppolo
  • 4,271
  • 6
  • 29
  • 59
  • Can you provide sample input file? or may be this can help https://stackoverflow.com/a/30326636/8884381 – Amit Gupta Jun 06 '22 at 15:51
  • @AmitGupta I added example of data. The answer you suggest doesn't do what I need. That answer makes a zoom and I need to rescale the y-axis. – Cedric Zoppolo Jun 06 '22 at 16:08

1 Answers1

2

Here is my take on your interesting question.

I don't know of a way to rescale the y-axis after having plotted the dataframe, but you can rescale the dataframe itself.

For instance, in the following toy dataframe, with an interval of measure of 15 minutes, 9 values are comprised between 35 and 40:

  • 4 values have been measured between 20:00:00 and 20:59:00
  • 1 between 21:00:00 and 21:59:00
  • 3 between 22:00:00 and 22:59:00
  • 1 between 23:00:00 and 23:59:00
import pandas as pd

df = pd.DataFrame(
    {
        "index": [
            "2021-04-10 20:00:00",
            "2021-04-10 20:15:00",
            "2021-04-10 20:30:00",
            "2021-04-10 20:45:00",
            "2021-04-10 21:00:00",
            "2021-04-10 21:15:00",
            "2021-04-10 21:30:00",
            "2021-04-10 21:45:00",
            "2021-04-10 22:00:00",
            "2021-04-11 22:15:00",
            "2021-04-11 22:30:00",
            "2021-04-11 22:45:00",
            "2021-04-11 23:00:00",
            "2021-04-11 23:15:00",
            "2021-04-11 23:30:00",
            "2021-04-11 23:45:00",
        ],
        "tag": [39, 36, 36, 37, 42, 28, 39, 54, 43, 38, 39, 36, 44, 27, 38, 28],
    },
)
df["index"] = pd.to_datetime(df["index"], format="%Y-%m-%d %H:%M:%S")

Here is the corresponding plot:

df.copy().set_index("index").plot(
    kind="hist", bins=range(20, 80, 5), yticks=range(0, 10), grid=True
)

enter image description here

Had the measurement been hourly based, 4 values would have been found in the 35-40 bin:

  • 1 (and not 4) between 20:00:00 and 20:59:00
  • 1 between 21:00:00 and 21:59:00
  • 1 (and not 3) between 22:00:00 and 22:59:00
  • 1 between 23:00:00 and 23:59:00

So, rescaling the dataframe hourly suppose to:

  • assign new columns for bins, dates and hours
  • sort values and drop rows with same bin, date and hour, keeping only the first duplicate row
  • cleanup and plot
_ = (
    df.assign(
        bin=pd.cut(df["tag"], bins=range(20, 60, 5)),
        date=df["index"].dt.date,
        hour=df["index"].dt.hour,
    )
    .sort_values(by=["bin", "date", "hour"])
    .drop_duplicates(subset=["bin", "date", "hour"], keep="first")
    .drop(columns=["bin", "date", "hour"])
    .set_index("index")
    .plot(kind="hist", bins=range(20, 80, 5), yticks=range(0, 5), grid=True)
)

Which outputs:

enter image description here

Laurent
  • 12,287
  • 7
  • 21
  • 37