Adding missing rows of pandas DataFrame when index contains duplicate data

Question

I have a DataFrame with dtype=object as:

          YY    MM  DD  hh  var1    var2
.
.
.

10512   2013    01  01  06  1.64    4.64
10513   2013    01  01  07  1.57    4.63
10514   2013    01  01  08  1.56    4.71
10515   2013    01  01  09  1.45    4.69
10516   2013    01  01  10  1.53    4.67
10517   2013    01  01  11  1.31    4.63
10518   2013    01  01  12  1.41    4.70
10519   2013    01  01  13  1.49    4.80
10520   2013    01  01  20  1.15    4.91
10521   2013    01  01  21  1.14    4.74
10522   2013    01  01  22  1.10    4.95

As seen, there are missing rows corresponding to hours (hh) (for instance between 10519 and 10520 rows, hh jumps from 13 to 20). I tried to add the gap by setting hh as index, as what was discussed here: Missing data, insert rows in Pandas and fill with NAN

df=df.set_index('hh')
new_index = pd.Index(np.arange(0,24), name="hh")
df=df.reindex(new_index).reset_index()

and reach something like:

          YY    MM  DD  hh  var1    var2

10519   2013    01  01  13  1.49    4.80
10520   2013    01  01  14  Nan     Nan
10521   2013    01  01  15  Nan     Nan
10522   2013    01  01  16  Nan     Nan
...
10523   2013    01  01  20  1.15    4.91
10524   2013    01  01  21  1.14    4.74
10525   2013    01  01  22  1.10    4.95

But I encounter the error "cannot reindex from a duplicate axis" for the part df=df.reindex(new_index). There are duplicate values for each hh=0,1,...,23, because same value of hh would be repeated for different months (MM) and years (YY). Probably that's the reason. How can I solve the problem?

In general,how can one fills the missing rows of pandas DataFrame when index contains duplicate data. I appreciate any comments.

where exactly is the missing data, I see no NaN for hh between 10519 and 10520? — Khalil Al Hooti, Sep 18 '18 at 03:30
@KhalilAlHooti data has a hourly resolution but it jumps from 13 to 20. I want to insert the missing rows ( 14, 15, 16,...) with Nan values for var1 and var 2 so that I can calculate Nans afterwards with simple techniques such as interpolatation. — Pierre Hoshyar, Sep 18 '18 at 03:39

score 1 · Accepted Answer · answered Sep 18 '18 at 03:53

1

First create a new column with the time, including date and hour, of type datetime. One way this can be done is as follows:

df = df.rename(columns={'YY': 'year', 'MM': 'month', 'DD': 'day', 'hh': 'hour'})
df['time'] = pd.to_datetime(df[['year', 'month', 'day', 'hour']])

To use to_datetime in this way, the column names need to be year, month, day and hour which is why rename is used.

To get the expected result, set this new column as the index and use resample:

df.set_index('time').resample('H').mean()

answered Sep 18 '18 at 03:53

Shaido

27,497
23
70
73

Thanks for your time. It solves the problem. The only thing is that the dtype needed to be numeric, not an object, which one can easily change. – Pierre Hoshyar Sep 18 '18 at 05:35
@PierreHoshyar: Happy to help. Yes, actually to use the `resample` method the columns themselves need to be numeric, but it can be solved by using `agg` instead of `mean`, see https://stackoverflow.com/questions/47613521/resample-pandas-dataframe-and-merge-strings-in-column – Shaido Sep 19 '18 at 01:42

score 0 · Answer 2 · answered Sep 18 '18 at 07:38

This code does exactly what you need.

import pandas as pd
import numpy as np
from io import StringIO

YY, MM, DD, hh, var1, var2 = [],[],[],[],[],[]


a = '''10512   2013    01  01  06  1.64    4.64
10513   2013    01  01  07  1.57    4.63
10514   2013    01  01  08  1.56    4.71
10515   2013    01  01  09  1.45    4.69
10516   2013    01  01  10  1.53    4.67
10517   2013    01  01  11  1.31    4.63
10518   2013    01  01  12  1.41    4.70
10519   2013    01  01  13  1.49    4.80
10520   2013    01  01  20  1.15    4.91
10521   2013    01  01  21  1.14    4.74
10522   2013    01  01  22  1.10    4.95
10523   2013    01  01  27  1.30    4.55
10524   2013    01  01  28  1.2     4.62
'''

text = StringIO(a)

for line in text.readlines():
    a = line.strip().split(" ")
    a = list(filter(None, a))
    YY.append(a[1])
    MM.append(a[2])
    DD.append(a[3])
    hh.append(a[4])
    var1.append(a[5])
    var2.append(a[6])

df = pd.DataFrame({'YY':YY, 'MM':MM, 'DD':DD,
                   'hh':hh, 'var1':var1, 'var2':var2})

df['hh'] = df.hh.astype(int)


a = np.diff(df.hh)
b = np.where(a!=1)


df2 = df.copy(deep=True)

for i in range(len(df)):

    if (i in b[0]):
        line = pd.DataFrame(columns=['YY', 'MM', 'DD',
                                     'hh', 'var1', 'var2'])
        for k in range(a[i]-1):

            line.loc[k]=[df2.iloc[i, 0], df2.iloc[i, 1],
                         df2.iloc[i, 2], df2.iloc[i, 3]+k+1 ,
                         np.nan, np.nan]

        df = pd.concat([df.loc[:i], 
                line, df.loc[i+1:]])


df.reset_index(inplace=True, drop=True)

print(df)

      YY  MM  DD  hh  var1  var2
0   2013  01  01   6  1.64  4.64
1   2013  01  01   7  1.57  4.63
2   2013  01  01   8  1.56  4.71
3   2013  01  01   9  1.45  4.69
4   2013  01  01  10  1.53  4.67
5   2013  01  01  11  1.31  4.63
6   2013  01  01  12  1.41  4.70
7   2013  01  01  13  1.49  4.80
8   2013  01  01  14   NaN   NaN
9   2013  01  01  15   NaN   NaN
10  2013  01  01  16   NaN   NaN
11  2013  01  01  17   NaN   NaN
12  2013  01  01  18   NaN   NaN
13  2013  01  01  19   NaN   NaN
14  2013  01  01  20  1.15  4.91
15  2013  01  01  21  1.14  4.74
16  2013  01  01  22  1.10  4.95
17  2013  01  01  23   NaN   NaN
18  2013  01  01  24   NaN   NaN
19  2013  01  01  25   NaN   NaN
20  2013  01  01  26   NaN   NaN
21  2013  01  01  27  1.30  4.55
22  2013  01  01  28   1.2  4.62

Thank you for your time. This also works. Just out of curiosity, those for loops may be time consuming when DataFrame is huge correct? — Pierre Hoshyar, Sep 18 '18 at 22:29
Probably yes. The inner for loop is the most time consuming especially if there are many samples missing. There is another drawback with the code. if for example hours data are missing between consecutive days, the code will not work except if grouping per day is applied before for loops which is simple to do. — Khalil Al Hooti, Sep 18 '18 at 22:38
Yes you mean there is no hour sample for more than one day. I thought of that, but I think the data is not empty for more than some hours. Thanks for your time. — Pierre Hoshyar, Sep 18 '18 at 22:41

Adding missing rows of pandas DataFrame when index contains duplicate data

2 Answers2