13

I'm using Pandas to store stock prices data using Data Frames. There are 2940 rows in the dataset. The Dataset snapshot is displayed below:

enter image description here

The time series data does not contain the values for Saturday and Sunday. Hence missing values have to be filled.
Here is the code I've written but it is not solving the problem:

import pandas as pd
import numpy as np
import os
os.chdir('C:/Users/Admin/Analytics/stock-prices')

data  = pd.read_csv('stock-data.csv')

# PriceDate Column - Does not contain Saturday and Sunday stock entries
data['PriceDate'] =  pd.to_datetime(data['PriceDate'], format='%m/%d/%Y')
data = data.sort_index(by=['PriceDate'], ascending=[True])


# Starting date is Aug 25 2004
idx = pd.date_range('08-25-2004',periods=2940,freq='D')


data = data.set_index(idx)
data['newdate']=data.index
newdate=data['newdate'].values   # Create a time series column   


data = pd.merge(newdate, data, on='PriceDate', how='outer')

How to fill the missing values for Saturday and Sunday?

User456898
  • 5,704
  • 5
  • 21
  • 37

1 Answers1

27

I think you can use resample with ffill or bfill, but before set_index from column PriceDate:

print (data)
   ID  PriceDate  OpenPrice  HighPrice
0   1  6/24/2016          1          2
1   2  6/23/2016          3          4
2   2  6/22/2016          5          6
3   2  6/21/2016          7          8
4   2  6/20/2016          9         10
5   2  6/17/2016         11         12
6   2  6/16/2016         13         14
data['PriceDate'] =  pd.to_datetime(data['PriceDate'], format='%m/%d/%Y')
data = data.sort_values(by=['PriceDate'], ascending=[True])
data.set_index('PriceDate', inplace=True)
print (data)
            ID  OpenPrice  HighPrice
PriceDate                           
2016-06-16   2         13         14
2016-06-17   2         11         12
2016-06-20   2          9         10
2016-06-21   2          7          8
2016-06-22   2          5          6
2016-06-23   2          3          4
2016-06-24   1          1          2

data = data.resample('D').ffill().reset_index()
print (data)
   PriceDate  ID  OpenPrice  HighPrice
0 2016-06-16   2         13         14
1 2016-06-17   2         11         12
2 2016-06-18   2         11         12
3 2016-06-19   2         11         12
4 2016-06-20   2          9         10
5 2016-06-21   2          7          8
6 2016-06-22   2          5          6
7 2016-06-23   2          3          4
8 2016-06-24   1          1          2

data = data.resample('D').bfill().reset_index()
print (data)
   PriceDate  ID  OpenPrice  HighPrice
0 2016-06-16   2         13         14
1 2016-06-17   2         11         12
2 2016-06-18   2          9         10
3 2016-06-19   2          9         10
4 2016-06-20   2          9         10
5 2016-06-21   2          7          8
6 2016-06-22   2          5          6
7 2016-06-23   2          3          4
8 2016-06-24   1          1          2
jezrael
  • 822,522
  • 95
  • 1,334
  • 1,252
  • 1
    When using bfill( ).reset_index( ), the following TypeError is displayed: Only valid with DatetimeIndex, TimedeltaIndex or PeriodIndex, but got an instance of 'RangeIndex' – User456898 Jul 14 '16 at 03:02
  • 2
    You need set index from column `PriceDate` - `data.set_index('PriceDate', inplace=True)`. – jezrael Jul 14 '16 at 04:26
  • 2
    I am not sure if understand correctly - do you need set new column - `data['new'] = data['PriceDate']` ? – jezrael Jul 14 '16 at 11:24
  • No, I actually got the solution. Wanted the corresponding dayname of data['PriceDate'] without repetition. This is not related to this particular question. Found the solution to print the daynames without repetition over here: http://stackoverflow.com/questions/30222533/create-a-day-of-week-column-in-a-pandas-dataframe-using-python – User456898 Jul 14 '16 at 11:50
  • @jezrael Any ideas on how to implement this when your data contains hundreds of different IDs? For example, I have time series data for 300 buildings and need to fill in time gaps for each as individual time series. I have written a function by hand which seems to work on small datasets but it's extremely slow. – Renel Chesak Feb 23 '18 at 03:31
  • @RenelChesak - You need `data = data.groupby('ID').resample('D').ffill().reset_index(level=0, drop=True).reset_index()` – jezrael Feb 23 '18 at 06:19