2

I am struggling to average my data into monthly mean taking into account that the data can represent two months (and years), and there are empty periods. Thus a need to weigh the samples accordantly, i.e. a sample with 2 days in one month and 5 in the next should have 2/7 and 5/7 weight, respectively. I have tried to first reindex to daily data and then resample, but problems with empty periods (do not want to fill these) and irregular sampling periods. Surly I can calculate the duration for each periods (end-start), but not sure if that helps because also need to know the split between months.

Here is part of the data:

start               end                 conc
26.12.2019 07:00    03.01.2020 07:00    0.235
03.01.2020 07:00    14.01.2020 07:00    nan
14.01.2020 07:00    20.01.2020 07:00    0.032
21.01.2020 07:00    29.01.2020 07:00    1.601
29.01.2020 07:00    06.02.2020 07:00    0.815
06.02.2020 07:00    10.02.2020 07:00    nan
10.02.2020 07:00    20.02.2020 07:00    3.029
20.02.2020 07:00    28.02.2020 07:00    2.158
28.02.2020 07:00    09.03.2020 07:00    0.784
09.03.2020 07:00    18.03.2020 07:00    0.304

The results should be:

jan.2020    0.829
feb.2020    2.128
Wenche Aas
  • 21
  • 4

1 Answers1

0

From a software design prospective, it looks like you need to do some data cleaning before trying to interact with the data.

Data Science Cycle

You need to define what you want to do. Do you want to get rid of all the nan values, or treat them as 0? What if an event starts in one month but ends in another? Once you have these questions, you can start thinking about what to code. Below would be the approach I would take (But I don't know exactly what you want to do with the data):

  1. Deal with the NaNs
  2. decide what month each data point is in, and add a column with that information (heres an example of how to get the month from python datatimes).
  3. use start and end date columns to create a new "duration" column
start               end                Duration    month   conc
26.12.2019 07:00    03.01.2020 07:00     x sec      Dec    0.235
03.01.2020 07:00    14.01.2020 07:00     y sec      Jan    nan
14.01.2020 07:00    20.01.2020 07:00     z sec      Jan    0.032
  1. Aggrigate the data by month. Add up all the durations in a month. Add up all the conc data in a month. avg = tot_conc / tot_duration. Then add this value to a new table:
 | month | avg_conc |
 |  Feb. |  2.218.  |
  • Thanks for commenting! I know what I want to do, I want to calculate a monthly average excluding the NaN. I guess my question is how to calculate the duration in each months when the event cover two different months. No problem to calculate the average if the samples are in same month. – Wenche Aas Apr 21 '21 at 07:31
  • That's a hard question, and It depends on what you are measuring. imagine some data that along the lines of: ``` start | end | val jan 25 | feb 2 | 6 ``` It might be reasonable to spilt this into 2 data points: ``` start | end | val jan 25 | jan 31 | 6 feb 1 | feb 2 | 6 ``` or: ``` start | end | val jan 25 | jan 31 | 5 feb 1 | feb 2 | 1 ``` sorry about the formating, Comments dont allow new lines. – Liam Warfield Apr 22 '21 at 20:07