5

Ok, I have a big dataframe such as:

      hour    value
  0      0      1
  1      6      2
  2     12      3
  3     18      4
  4      0      5
  5      6      6
  6     12      7
  7     18      8
  8      6      9
  9     12     10
 10     18     11
 11     12     12
 12     18     13
 13      0     14

Let's don't get lost here. The column hour represents the hours of the day, from 6 to 6 hours. Column values is well, exactly that, here the values are as an example, not the actual ones.

If you look closely to the hour column, you can see that there are hours missing. For instance, there is a gap between rows 7 and 8 (the value of hour 0 is missing). There are also bigger gaps, such as in between rows 10 and 11 (hours 00 and 06).

What do I need? I would like to check when an hour (and of course) a value is missing, and complete the dataframe inserting a row there with the corresponding hour and a np.nan as value.

What have I thought? I think this would be easily solved using modular arithmetic, in this case with mod 24, such as when 18 + 6 = 24 = 0 mod 24. So initializing the counter to zero and adding 6 with the caveat that the counter is defined in modular arithmetic mod 24 you can verify if each hour is the corresponding hour, and if not, insert a new row with the corresponding hour and with np.nan as value.

I don't know how to do the implementation of modular arithmetic in python to iterate a dataframe column.

Thank you very much.

piRSquared
  • 285,575
  • 57
  • 475
  • 624
David
  • 1,155
  • 1
  • 13
  • 35
  • You could use `reindex` with an input such as `n * [0, 6, 12, 18]`. You would need to make `hour` the index first. – IanS May 19 '16 at 16:07
  • 1
    @IanS you need to identify which groups of rows constitute a day otherwise, you get `ValueError: cannot reindex from a duplicate axis` – piRSquared May 19 '16 at 16:29
  • And how do you know if you have a gap of four missing rows? – John Y May 19 '16 at 18:11

2 Answers2

6

Solution

group_hours = (df.hour <= df.hour.shift()).cumsum()

def insert_missing_hours(df):
    return df.set_index('hour').reindex([0, 6, 12, 18]).reset_index()

df.groupby(group_hours).apply(insert_missing_hours).reset_index(drop=1)

Looks like:

    hour  value
0      0    1.0
1      6    2.0
2     12    3.0
3     18    4.0
4      0    5.0
5      6    6.0
6     12    7.0
7     18    8.0
8      0    NaN
9      6    9.0
10    12   10.0
11    18   11.0
12     0    NaN
13     6    NaN
14    12   12.0
15    18   13.0
16     0   14.0
17     6    NaN
18    12    NaN
19    18    NaN

Explanation

In order to apply reindex I needed to determine which rows to group. I checked to see if row's hour was less or equal than prior row's hour. If so, that flags a new group.

insert_missing_hours is precisely the reindex of subgroups with [0, 6, 12, 18].

piRSquared
  • 285,575
  • 57
  • 475
  • 624
4

Although this solution is more complex, it should be much more efficient, especially for larger dataframes. This solution creates a list of hours and values (including any filled hours) and then uses this list to produce a dataframe.

def hour_checker(hours, values):
    def check_hour(hour):
        if hour not in (0, 6, 12, 18):
            raise ValueError('Invalid hour')
    [check_hour(hour) for hour in hours]
    result = []
    valid_hours = np.arange(0, 24, 6)
    while valid_hours[-1] != hour:
        # Initialize.
        valid_hours = np.roll(valid_hours, -1)
        result.append([hours.iat[0], values.iat[0]])
    for hour, value in zip(hours.iloc[1:], values.iloc[1:]):
        while hour != valid_hours[0]:
            result.append([valid_hours[0], None])
            valid_hours = np.roll(valid_hours, -1)
        result.append([hour, value])
        valid_hours = np.roll(valid_hours, -1)
    return pd.DataFrame(result, columns=['hour', 'value'])

hour_checker(df['hour'], df['value'])
Out[33]: 
    hour  value
0      0      1
1      6      2
2     12      3
3     18      4
4      0      5
5      6      6
6     12      7
7     18      8
8      0    NaN
9      6      9
10    12     10
11    18     11
12     0    NaN
13     6    NaN
14    12     12
15    18     13
16     0     14

Timings

df_test = pd.concat([df] * 100)

%%timeit
group_hours = (df_test.hour <= df_test.hour.shift()).cumsum()
df_test.groupby(group_hours).apply(insert_missing_hours).reset_index(drop=1)
1 loops, best of 3: 611 ms per loop

%timeit hour_checker(df_test['hour'], df_test['value'])
100 loops, best of 3: 12.4 ms per loop
Alexander
  • 105,104
  • 32
  • 201
  • 196