I have a DataFrame with dtype=object
as:
YY MM DD hh var1 var2
.
.
.
10512 2013 01 01 06 1.64 4.64
10513 2013 01 01 07 1.57 4.63
10514 2013 01 01 08 1.56 4.71
10515 2013 01 01 09 1.45 4.69
10516 2013 01 01 10 1.53 4.67
10517 2013 01 01 11 1.31 4.63
10518 2013 01 01 12 1.41 4.70
10519 2013 01 01 13 1.49 4.80
10520 2013 01 01 20 1.15 4.91
10521 2013 01 01 21 1.14 4.74
10522 2013 01 01 22 1.10 4.95
As seen, there are missing rows corresponding to hours (hh
) (for instance between 10519 and 10520 rows, hh
jumps from 13 to 20). I tried to add the gap by setting hh
as index, as what was discussed here: Missing data, insert rows in Pandas and fill with NAN
df=df.set_index('hh')
new_index = pd.Index(np.arange(0,24), name="hh")
df=df.reindex(new_index).reset_index()
and reach something like:
YY MM DD hh var1 var2
10519 2013 01 01 13 1.49 4.80
10520 2013 01 01 14 Nan Nan
10521 2013 01 01 15 Nan Nan
10522 2013 01 01 16 Nan Nan
...
10523 2013 01 01 20 1.15 4.91
10524 2013 01 01 21 1.14 4.74
10525 2013 01 01 22 1.10 4.95
But I encounter the error "cannot reindex from a duplicate axis"
for the part df=df.reindex(new_index)
.
There are duplicate values for each hh=0,1,...,23
, because same value of hh
would be repeated for different months (MM
) and years (YY
).
Probably that's the reason. How can I solve the problem?
In general,how can one fills the missing rows of pandas DataFrame when index contains duplicate data. I appreciate any comments.