0

I'm trying to learn IoT data using time series. The data comes from two different sources. In some measurements, the difference between the sources is very small: one source has 11 rows and the second source has 15 rows. In other measurements, one source has 30 rows and the second source has 240 rows.

Thought to interpolate using:

 df.resample('20ms').interpolate()

but sow that it delete some rows. Is there any method to interpolate without deleting or should I delete rows?

EDIT - data and code:

#!/usr/bin/env python3.6
import pandas as pd
import sklearn.preprocessing
from pandas import read_csv
from pandas import datetime
from matplotlib import pyplot
first_df_file_name='interpolate_test.in'
df = read_csv(first_df_file_name, header=0, squeeze=True, delimiter=' ')
print(df.head(5))
idx=0
new_col = pd.date_range('1/1/2011 00:00:00.000000', periods=len(df.index), freq='100ms')
df.insert(loc=idx, column='date', value=new_col)
df.set_index('date', inplace=True)
upsampled = df.resample('20ms').interpolate()
print('20 ms, num rows', len(upsampled.index))
print(upsampled.head(5))
upsampled.to_csv('test_20ms.out')
upsampled = df.resample('60ms').interpolate()
print('60 ms, num rows', len(upsampled.index))
print(upsampled.head(5))
upsampled.to_csv('test_60ms.out')

This is the test input file name:

a b
100 200
200 400
300 600
400 800
500 1000
600 1100
700 1200
800 1300
900 1400
1000 2000

Here is the output (parts of it)

 //output of interpolating by 20 milis - this is fine
                         a      b
 date                                 
 2011-01-01 00:00:00.000  100.0  200.0
 2011-01-01 00:00:00.020  120.0  240.0
 2011-01-01 00:00:00.040  140.0  280.0
 2011-01-01 00:00:00.060  160.0  320.0
 2011-01-01 00:00:00.080  180.0  360.0
 60 ms, num rows 16

 //output when interpolating by 60 milis - data is lost
                         a      b
 date                                 
 2011-01-01 00:00:00.000  100.0  200.0
 2011-01-01 00:00:00.060  160.0  320.0
 2011-01-01 00:00:00.120  220.0  440.0
 2011-01-01 00:00:00.180  280.0  560.0
 2011-01-01 00:00:00.240  340.0  680.0

So, should I delete rows from the largest source instead of interpolating? If I'm interpolating, how can I avoid loosing data?

nmnir
  • 568
  • 1
  • 10
  • 24
  • 1
    Hi, please see [How to ask](https://stackoverflow.com/help/how-to-ask) and [How to create a MCVE](https://stackoverflow.com/help/mcve). For `pandas`, see [How to ask a good pandas question](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples). – Evan Jun 01 '19 at 03:49
  • edited. without the original data, with testing data. – nmnir Jun 01 '19 at 08:00
  • Like this/ https://stackoverflow.com/questions/35918248/keep-original-data-points-when-padding-a-signal-with-pandas – Evan Jun 02 '19 at 21:53
  • @Evan not sure that it works as needed. according to https://datascience.stackexchange.com/questions/25924/difference-between-interpolate-and-fillna-in-pandas fillna cannot recieve as a parameter a function. So if there 3 missing values between 100 and 200 I can't fill it to have 100, 125, 250, 275, 200 – nmnir Jun 05 '19 at 07:09

0 Answers0