0

I have an issue similar to "ValueError: cannot reindex from a duplicate axis".The solution isn't provided.

I have an excel file containing multiple rows and columns of weather data. Data has missing at certain intervals although not shown in the sample below. I want to reindex the time column at 5 minute intervals so that I can interpolate the missing values. Data Sample:


Date        Time    Temp    Hum Dewpnt  WindSpd
04/01/18    12:05 a 30.6    49  18.7    2.7
04/01/18    12:10 a NaN     51  19.3    1.3
04/01/18    12:20 a 30.7   NaN  19.1    2.2
04/01/18    12:30 a 30.7    51  19.4    2.2 
04/01/18    12:40 a 30.9    51  19.6    0.9

Here's what I have tried.

import pandas as pd
ts = pd.read_excel('E:\DATA\AP.xlsx')
ts['Time'] = pd.to_datetime(ts['Time'])
ts.set_index('Time', inplace=True)
dt = pd.date_range("2018-04-01 00:00:00", "2018-05-01 00:00:00", freq='5min', name='T')
idx = pd.DatetimeIndex(dt)
ts.reindex(idx)

I just just want to have my index at 5 min frequency so that I can interpolate the NaN later. Expected output:

Date        Time    Temp    Hum Dewpnt  WindSpd
04/01/18    12:05 a 30.6    49  18.7    2.7
04/01/18    12:10 a NaN     51  19.3    1.3
04/01/18    12:15 a NaN   NaN  NaN     NaN
04/01/18    12:20 a 30.7   NaN  19.1    2.2
04/01/18    12:25 a NaN   NaN  NaN     NaN
04/01/18    12:30 a 30.7    51  19.4    2.2  

5 Answers5

2

One more approach.

df['Time'] = pd.to_datetime(df['Time'])
df = df.set_index(['Time']).resample('5min').last().reset_index()
df['Time'] = df['Time'].dt.time
df

output

       Time     Date        Temp    Hum     Dewpnt  WindSpd
0   00:05:00    4/1/2018    30.6    49.0    18.7    2.7
1   00:10:00    4/1/2018    NaN     51.0    19.3    1.3
2   00:15:00    NaN         NaN     NaN     NaN     NaN
3   00:20:00    4/1/2018    30.7    NaN     19.1    2.2
4   00:25:00    NaN         NaN     NaN     NaN     NaN
5   00:30:00    4/1/2018    30.7    51.0    19.4    2.2
6   00:35:00    NaN         NaN     NaN     NaN     NaN
7   00:40:00    4/1/2018    30.9    51.0    19.6    0.9

If times from multiple dates have to be re-sampled, you can use code below.

However, you will have to seperate 'Date' & 'Time' columns later.

df1['DateTime'] = df1['Date']+df1['Time']
df1['DateTime'] = pd.to_datetime(df1['DateTime'],format='%d/%m/%Y%I:%M %p')
df1 = df1.set_index(['DateTime']).resample('5min').last().reset_index()
df1

Output

DateTime    Date    Time    Temp    Hum     Dewpnt  WindSpd
0   2018-01-04 00:05:00     4/1/2018    12:05 AM    30.6    49.0    18.7    2.7
1   2018-01-04 00:10:00     4/1/2018    12:10 AM    NaN     51.0    19.3    1.3
2   2018-01-04 00:15:00     NaN     NaN     NaN     NaN     NaN     NaN
3   2018-01-04 00:20:00     4/1/2018    12:20 AM    30.7    NaN     19.1    2.2
4   2018-01-04 00:25:00     NaN     NaN     NaN     NaN     NaN     NaN
5   2018-01-04 00:30:00     4/1/2018    12:30 AM    30.7    51.0    19.4    2.2
6   2018-01-04 00:35:00     NaN     NaN     NaN     NaN     NaN     NaN
7   2018-01-04 00:40:00     4/1/2018    12:40 AM    30.9    51.0    19.6    0.9
moys
  • 7,747
  • 2
  • 11
  • 42
  • It is working but the output seems to have only the last 1 day data (last day April). How can it be modified to include the whole dataset? I think both the Date and time columns need to be combined. – user8277017 Feb 14 '20 at 04:57
  • I have got it to work with helpful suggestions from your answer. Added the working code. – user8277017 Feb 14 '20 at 05:38
0

You can try this for example:

import pandas as pd
ts = pd.read_excel('E:\DATA\AP.xlsx')
ts['Time'] = pd.to_datetime(ts['Time'])
ts.set_index('Time', inplace=True)
ts.resample('5T').mean()

More information here: https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.resample.html

LeZenith
  • 83
  • 7
0

Set the Time column as the index, making sure it is DateTime type, then try

ts.asfreq('5T')

use

ts.asfreq('5T', method='ffill')

to pull previous values forward.

Jamie
  • 146
  • 7
0

I would take the approach of creating a blank table and fill it in with the data as it comes from your data source. For this example three observations are read in as NaN, plus the row for 1:15 and 1:20 is missing.

import pandas as pd
import numpy as np
rawpd = pd.read_excel('raw.xlsx')
print(rawpd)
    Date      Time  Col1  Col2

0 2018-04-01 01:00:00 1.0 10.0
1 2018-04-01 01:05:00 2.0 NaN
2 2018-04-01 01:10:00 NaN 10.0
3 2018-04-01 01:20:00 NaN 10.0
4 2018-04-01 01:30:00 5.0 10.0

Now create a dataframe targpd with the ideal structure.

time5min = pd.date_range(start='2018/04/1 01:00',periods=7,freq='5min')
targpd = pd.DataFrame(np.nan,index = time5min,columns=['Col1','Col2'])
print(targpd)

                 Col1  Col2 

2018-04-01 01:00:00 NaN NaN
2018-04-01 01:05:00 NaN NaN
2018-04-01 01:10:00 NaN NaN
2018-04-01 01:15:00 NaN NaN
2018-04-01 01:20:00 NaN NaN
2018-04-01 01:25:00 NaN NaN
2018-04-01 01:30:00 NaN NaN

Now the trick is to update targpd with the data sent to you in rawpd. For this to happen the Date and Time columns have to be combined in rawpd and made into an index.

print(rawpd.Date,rawpd.Time)

0 2018-04-01
1 2018-04-01
2 2018-04-01
3 2018-04-01
4 2018-04-01

Name: Date, dtype: datetime64[ns]
0 01:00:00
1 01:05:00
2 01:10:00
3 01:20:00
4 01:30:00
Name: Time, dtype: object
You can see above the trick in all this. Your date data was converted to datetime but your time data is just a string. Below a proper index is created by used of a lambda function.

rawidx=rawpd.apply(lambda r : pd.datetime.combine(r['Date'],r['Time']),1)
print(rawidx)

This can be applied to the rawpd database as an index.

rawpd2=pd.DataFrame(rawpd[['Col1','Col2']].values,index=rawidx,columns=['Col1','Col2'])
rawpd2=rawpd2.sort_index()
print(rawpd2)

Once this is in place the update command can get you what you want.

targpd.update(rawpd2,overwrite=True)
print(targpd)

                 Col1  Col2

2018-04-01 01:00:00 1.0 10.0

2018-04-01 01:00:00 1.0 10.0

2018-04-01 01:05:00 2.0 NaN

2018-04-01 01:10:00 NaN 10.0

2018-04-01 01:15:00 NaN NaN

2018-04-01 01:20:00 NaN 10.0

2018-04-01 01:25:00 NaN NaN

2018-04-01 01:30:00 5.0 10.0

2018-04-01 01:05:00 2.0 NaN

2018-04-01 01:10:00 NaN 10.0

2018-04-01 01:15:00 NaN NaN

2018-04-01 01:20:00 NaN 10.0

2018-04-01 01:25:00 NaN NaN

2018-04-01 01:30:00 5.0 10.0

You now have a file ready for interpolation

  • Hi, I am getting this error when running rawpd2. KeyError: "None of [Index(['Col1', 'Col2'], dtype='object')] are in the [columns]" Everything is fine until then. – user8277017 Feb 14 '20 at 04:13
  • lambda functions can be tough to debug. I would look for a typo like 'col1' versus 'Col1' – Harold Henson Feb 14 '20 at 12:46
0

I have got it to work. thank you everyone for your time. I am providing the working code.

import pandas as pd
df = pd.read_excel('E:\DATA\AP.xlsx', sheet_name='Sheet1', parse_dates=[['Date', 'Time']])
df = df.set_index(['Date_Time']).resample('5min').last().reset_index()
print(df)