1

I have this data on a csv file:

  Date/Time      kWh  kVArh        kVA      PF
0    2021-01-01 00:30:00   471.84   0.00   943.6800  1.0000
1    2021-01-01 01:00:00   491.04   1.44   982.0842  1.0000
2    2021-01-01 01:30:00   475.20   0.00   950.4000  1.0000
3    2021-01-01 02:00:00   470.88   0.00   941.7600  1.0000
4    2021-01-01 02:30:00   466.56   0.00   933.1200  1.0000
...                  ...      ...    ...        ...     ...
9223 2021-07-14 04:00:00  1104.00  53.28  2210.5698  0.9988
9224 2021-07-14 04:30:00  1156.30  49.92  2314.7542  0.9991
9225 2021-07-14 05:00:00  1176.00  37.92  2353.2224  0.9995
9226 2021-07-14 05:30:00  1177.00  27.36  2354.6359  0.9997
9227 2021-07-14 06:00:00  1196.60  22.56  2393.6253  0.9998

And I use this code to read it and later export it to a csv file, after I calculate the average for every hour.

import pandas as pd
file = pd.read_csv('Electricity_data.csv', 
                    sep = ',', 
                    skiprows = 0,
                    dayfirst = True,
                    parse_dates = ['Date/Time'])

pd_mean = file.groupby(pd.Grouper(key = 'Date/Time', freq = 'H')).mean().reset_index()

pd_mean.to_csv("data_1h_year_.csv")

However, when I run it, my final file has a gap.

Data before the code launches (Date: 03/01/2021):

Date/Time     kWh  kVArh        kVA   PF
90  2021-02-01 21:30:00  496.83   0.00   993.6600  1.0
91  2021-02-01 22:00:00  486.72   0.00   973.4400  1.0
92  2021-02-01 22:30:00  490.08   0.00   980.1600  1.0
93  2021-02-01 23:00:00  503.00   1.92  1006.0073  1.0
94  2021-02-01 23:30:00  484.84   0.00   969.6800  1.0
95  2021-03-01 00:00:00  484.80   0.00   969.6000  1.0
96  2021-03-01 00:30:00  487.68   0.00   975.3600  1.0
97  2021-03-01 01:00:00  508.30   1.44  1016.6041  1.0
98  2021-03-01 01:30:00  488.66   0.00   977.3200  1.0
99  2021-03-01 02:00:00  486.24   0.00   972.4800  1.0
100 2021-03-01 02:30:00  495.36   1.44   990.7242  1.0
101 2021-03-01 03:00:00  484.32   0.00   968.6400  1.0
102 2021-03-01 03:30:00  485.76   0.00   971.5200  1.0
103 2021-03-01 04:00:00  492.48   1.44   984.9642  1.0
104 2021-03-01 04:30:00  476.16   0.00   952.3200  1.0
105 2021-03-01 05:00:00  477.12   0.00   954.2400  1.0

Data after the code launches (Date: 03/01/2021):

Date/Time       kWh   kVArh         kVA       PF
45 2021-01-02 21:00:00  1658.650  292.32  3368.45000  0.98485
46 2021-01-02 22:00:00  1622.150  291.60  3296.34415  0.98420
47 2021-01-02 23:00:00  1619.300  261.36  3280.52380  0.98720
48 2021-01-03 00:00:00       NaN     NaN         NaN      NaN
49 2021-01-03 01:00:00       NaN     NaN         NaN      NaN
50 2021-01-03 02:00:00       NaN     NaN         NaN      NaN
51 2021-01-03 03:00:00       NaN     NaN         NaN      NaN
52 2021-01-03 04:00:00       NaN     NaN         NaN      NaN
53 2021-01-03 05:00:00       NaN     NaN         NaN      NaN
54 2021-01-03 06:00:00  1202.400  158.40  2425.57730  0.99140
55 2021-01-03 07:00:00  1209.375  168.00  2441.98105  0.99050
56 2021-01-03 08:00:00  1260.950  162.72  2542.89820  0.99175
57 2021-01-03 09:00:00  1308.975  195.60  2647.07935  0.98900
58 2021-01-03 10:00:00  1334.150  193.20  2696.17005  0.98965

I do not know why this is happening, but it didn't calculate the mean values and I got the NaN forming gaps around the final csv file.

Sultry T.
  • 69
  • 10
  • 1
    Please read this: https://stackoverflow.com/q/20109391/8947333 and update your question: write code so we can create a minimal dataframe – Be Chiller Too Oct 20 '21 at 12:21
  • you need to check the calculations you are performing - some give an error, maybe a overflow, division by zero, non-valid input; second thing - check before writing again to file, if you use pandas dataframe you can replace NaNs with some other value before writing – Petronella Oct 20 '21 at 12:21
  • It's because `pd.Grouper(key = 'Date/Time', freq = 'H')` fill missing hours – Corralien Oct 20 '21 at 12:21
  • @Corralien what do you men with the fill missing hour? My data does not have any empty spaces. – Sultry T. Oct 20 '21 at 12:29
  • Can you try this code, please: `df.groupby(df['Date/Time'].dt.round('H')).mean()` – Corralien Oct 20 '21 at 12:32
  • 1
    I don't understand how for 2021-02-01 22:00:00, the average energy (kWh) could be 1622.150 because the mean between 486.72 (22:00) and 490.08 (22:30) is 488.4? – Corralien Oct 20 '21 at 12:37
  • @Corralien That solves the NaN problem !! But it doesn't calculates the mean value. Ex: `2021-03-01 00:00:00 484.80 0.00 969.6000 1.0 2021-03-01 00:30:00 487.68 0.00 975.3600 1.0` returns `2021-01-03 00:00:00 1642.100000 269.28 3328.065000 0.986800` but those are not the mean values ! – Sultry T. Oct 20 '21 at 12:39
  • @Corralien, indeed that was the second problem. Understand why my code is calculating this. I use the same code on another complex database and it works correctly. – Sultry T. Oct 20 '21 at 12:41
  • 1
    @SultryT. I think there is a matter with the `dayfirst = True` argument. The `2021-07-14` doesn't seems to be a dayfirst one. – ndclt Oct 20 '21 at 12:48

1 Answers1

1

Pandas does not interpret correctly your dates. Specify the format yourself.

Use the code below to solve your problem:

parser = lambda x: pd.to_datetime(x, format='%m/%d/%Y %H:%M')

df = pd.read_csv('data.csv', sep=',', skiprows=0,
                 parse_dates=['Date/Time'], date_parser=parser)
pd_mean = df.groupby(pd.Grouper(key='Date/Time', freq='H')).mean()

Check your dates before operation:

93  2021-02-01 23:00:00  # February, 1st
94  2021-02-01 23:30:00  # February, 1st
95  2021-03-01 00:00:00  # March, 1st
96  2021-03-01 00:30:00  # March, 1st
Corralien
  • 109,409
  • 8
  • 28
  • 52
  • 1
    @SultryT. Does it solve your problem? – Corralien Oct 20 '21 at 12:54
  • I understand the logic, I get this error: `ValueError: time data '1/1/2021 0:30' does not match format '%Y-%d-%m %H:%M:%S' (match)` – Sultry T. Oct 20 '21 at 12:59
  • If I take off the `dayfirst = True` that has a near value, but it doesn't calculate the average value. – Sultry T. Oct 20 '21 at 13:14
  • Try to change the format to`'%d/%m/%Y %H:%M:%S'`. Tes dates sont au format français? `dayfirst` in this case is useless. – Corralien Oct 20 '21 at 13:21
  • Oui, efectivement ! I think they are in french format. How I can have the same format without the same **error**? `ValueError: time data '1/1/2021 0:30' does not match format '%d/%m/%Y %H:%M:%S' (match)` – Sultry T. Oct 20 '21 at 13:30
  • Ok, can you change the answer, so I can accepted as correct, please? I used the code like this: `parser = lambda x: pd.to_datetime(x, format='%m/%d/%Y %H:%M')` `df = pd.read_csv('data.csv', sep = ',', skiprows = 0, parse_dates = ['Date/Time'], date_parser=parser)` `pd_mean = df.groupby(pd.Grouper(key = 'Date/Time', freq = 'H')).mean()` Merci bcp Corralien ! – Sultry T. Oct 20 '21 at 13:44
  • I updated my answer. Merci pour ton vote :) – Corralien Oct 20 '21 at 14:40