Iterating pandas dataframe, checking values and creating some of them

Question

Ok, I have a (big) dataframe, something like this:

         date       time      value
0     20100201         0         1
1     20100201         6         2
2     20100201        12         3
3     20100201        18         4
4     20100202         0         5
5     20100202         6         6
6     20100202        12         7
7     20100202        18         8
8     20100203         0         9
9     20100203        18        11
10    20100204         6        12
...
8845  20160101        18      8846

As you can see, the dataframe has a column date, a column time with four hours for each day (00, 06, 12, 18) and a column value.

The problem is that there are missing dates in the dataframe, in the example above there should be two extra rows between rows 8 and 9, corresponding to the hours 6 and 12 of the day 20100203, and also an extra row between rows 9 and 10 corresponding to the hour 0 of the day 20100204.

What would I need? I would like to iterate the date column of the dataframe, checking that every day exists and no one is missing, and also that for every day there are the four hours (00, 06, 12, 18). In case that something is missing during the iteration there should be added in exactly that place, with the missing date and time and NaN as a value. In order to not copy all the dataframe again, let me put the relevant aspects that there should appear in a final version:

...
7     20100202        18         8
8     20100203         0         9
9     20100203         6       NaN
10    20100203        12       NaN   
11    20100203        18        11
12    20100204         0       NaN
13    20100204         6        12
...

In case you are interested, an easier version of this problem was asked here Modular arithmetic in python to iterate a pandas dataframe and kindly answered by users @Alexander and @piRSquared. The version asked here is a more difficult one, involving (I suppose) the use of datetime and timedelta and iterating more columns.

Sorry for the long post and thank you very much.

score 1 · Accepted Answer · edited Sep 26 '17 at 14:42

You can use pivot for reshaping - you get NaN in missing values by column time, then unstack with reset_index and sort_values:

import pandas as pd

df = pd.DataFrame({'date': {0: 20100201, 1: 20100201, 2: 20100201, 3: 20100201, 4: 20100202, 5: 20100202, 6: 20100202, 7: 20100202, 8: 20100203, 9: 20100203, 10: 20100204}, 
                   'time': {0: 0, 1: 6, 2: 12, 3: 18, 4: 0, 5: 6, 6: 12, 7: 18, 8: 0, 9: 18, 10: 6},
                   'value': {0: 1, 1: 2, 2: 3, 3: 4, 4: 5, 5: 6, 6: 7, 7: 8, 8: 9, 9: 11, 10: 12}})

print (df)
        date  time  value
0   20100201     0      1
1   20100201     6      2
2   20100201    12      3
3   20100201    18      4
4   20100202     0      5
5   20100202     6      6
6   20100202    12      7
7   20100202    18      8
8   20100203     0      9
9   20100203    18     11
10  20100204     6     12

print (df.pivot(index='date', columns='time', values='value')
         .unstack()
         .reset_index(name='value')
         .sort_values('date'))

    time      date  value
0      0  20100201    1.0
4      6  20100201    2.0
8     12  20100201    3.0
12    18  20100201    4.0
1      0  20100202    5.0
5      6  20100202    6.0
9     12  20100202    7.0
13    18  20100202    8.0
2      0  20100203    9.0
6      6  20100203    NaN
10    12  20100203    NaN
14    18  20100203   11.0
3      0  20100204    NaN
7      6  20100204   12.0
11    12  20100204    NaN
15    18  20100204    NaN

Maybe you can reset_index again, if you need nice index like:

print (df.pivot(index='date', columns='time', values='value')
         .unstack()
         .reset_index(name='value')
         .sort_values('date')
         .reset_index(drop=True))

    time      date  value
0      0  20100201    1.0
1      6  20100201    2.0
2     12  20100201    3.0
3     18  20100201    4.0
4      0  20100202    5.0
5      6  20100202    6.0
6     12  20100202    7.0
7     18  20100202    8.0
8      0  20100203    9.0
9      6  20100203    NaN
10    12  20100203    NaN
11    18  20100203   11.0
12     0  20100204    NaN
13     6  20100204   12.0
14    12  20100204    NaN
15    18  20100204    NaN

It doesn't seem to work, it raises `ValueError: Index contains duplicate entries, cannot reshape`... — David, May 25 '16 at 12:35
So it means, you have duplicates - for some `date` and `time` you have multiple values. It meand e.g. `0 20100201 0 1` and second row is `0 20100201 0 5`. It is correct? — jezrael, May 25 '16 at 12:43
Ok, in that case it should be possible to use `drop_duplicates` or something related to eliminate the duplicity, right? But I am afraid that if I eliminate the duplicates, the index is not going to be 0, 1, 2... but something like 0, 1, 3... and I need an ordered index to use your method... — David, May 25 '16 at 12:55
Hmmm, is there one method - aggraegate duplicates. Give me time, I add solution. — jezrael, May 25 '16 at 12:57

score 0 · Answer 2 · answered May 25 '16 at 13:35

Ok, thank you, it is almost almost done, there is something missing, I would need the dataframe to be ordered, i.e., for each day, beginning with 20100201, the first row for the 00 hour, the second for 06, the third for 12, the fourth for 18, then 20100202 beginning with 00 hour and so on until the final date in the year 2016... This order is important to be able to do some statistics with the data. Let me show you what I get:

      time      date  value
   0     0  20100201  281.0
2224     6  20100201  278.0
4448    12  20100201  285.4
6672    18  20100201  287.6
2225     6  20100202  280.6
4449    12  20100202  287.2
6673    18  20100202  287.8
   1     0  20100202  282.4
   2     0  20100203  281.6
6674    18  20100203  287.8
4450    12  20100203  285.1
2226     6  20100203  281.0
6675    18  20100204  289.4
4451    12  20100204  286.8
   3     0  20100204  284.6
2227     6  20100204  284.2
...

(By the way, in the highly probable case of repetition in the value column, I suppose there is no problem, right? The solution is designed to eliminate the duplicates simultaneously in the other two columns, right?)

I dont know, but it seems if use `sort_values` as in my solution, that it is right? It seems my data are ordered by `date` and `time`. Maybe with real data you need `.sort_values(['date', 'time'])` ? — jezrael, May 25 '16 at 13:44
Yes, with `.sort_values(['date', 'time'])` I get the data ordered. Thank you, thank you very much for your patience and support. You have been an invaluable help. I wish to know as much as you. — David, May 25 '16 at 13:53

Iterating pandas dataframe, checking values and creating some of them

2 Answers2

Linked