0

I get csv files containing share price data. A sample is as below.

'dttm','open','high','low','close'
"2014/01/01 09:16:00",6365.2001953125,6369.89990234375,6355,6355,0
"2014/01/01 09:17:00",6355.64990234375,6359.9501953125,6355.5498046875,6359.5498046875,0
"2014/01/01 09:18:00",6359.5,6359.7998046875,6358,6359,0
"2014/01/01 09:19:00",6358.9501953125,6359.4501953125,6357.5498046875,6359,0
"2014/01/01 09:20:00",6359,6359,6355.64990234375,6356.5,0
.....likewise till "2014/01/01 15:30:30"  (and for further dates ahead)

Every row contains data for a minute.

Problem:-
Sometimes, a minute data is skipped. e.g. row for "2014/01/01 09:18:00" would not be present.
This hampers my program logic ahead.

What I require:-
is to validate the csv file whether rows for every 1 minute are present between 09:15:15 to 15:30:30 for every date. If not, copy previous row and insert for that minute (which is not present).

Can anybody pl. help?
Thanks.

Vineet
  • 624
  • 1
  • 11
  • 26

2 Answers2

1

you could basically read two consecutive lines and get time delta. If it is not 1 minute, means you're missing a line. Just write to the csv with a new line character at the end! you could also write everything to a new CSV file.

import csv
import datetime
f = open("your_file.csv", "w+")
ff = csv.reader(f)    
pre_line = ff.next()
while(True):
    try:
        cur_line = ff.next()
        if cur_line - pre_line != # 1 minute difference:
            f.write(pre_line)
            f.write('/n')
    except:
        break
SamAtWork
  • 455
  • 5
  • 17
1

Here is sample code you can use:

from dateutil.parser import parse
from datetime import datetime, timedelta


data = [
    ("2014/01/01 09:16:00",6365.2001953125,6369.89990234375,6355,6355,0),
    ("2014/01/01 09:17:00",6355.64990234375,6359.9501953125,6355.5498046875,6359.5498046875,0),
    ("2014/01/01 09:20:00",6359,6359,6355.64990234375,6356.5,0),
]


def insert_into_db(date, open, high, low, close, zero):
    print('inserting {} {} {} {} {} {}'.format(date, open, high, low, close, zero))

prev_date = None
for date, open, high, low, close, zero in data:
    date = parse(date)

    if prev_date is not None and date - prev_date > timedelta(minutes=1):
        for i in reversed(range((date - prev_date).seconds // 60 - 1)):
            date_between = date - timedelta(minutes=1 * i + 1)
            insert_into_db(date_between, open, high, low, close, zero)

    insert_into_db(date, open, high, low, close, zero)
    prev_date = date

The output is:

inserting 2014-01-01 09:16:00 6365.2001953125 6369.89990234375 6355 6355 0
inserting 2014-01-01 09:17:00 6355.64990234375 6359.9501953125 6355.5498046875 6359.5498046875 0
inserting 2014-01-01 09:18:00 6358.9501953125 6359.4501953125 6357.5498046875 6359 0
inserting 2014-01-01 09:19:00 6358.9501953125 6359.4501953125 6357.5498046875 6359 0
inserting 2014-01-01 09:20:00 6359 6359 6355.64990234375 6356.5 0

But you should make sure start and end minutes present (or adapt the code).

Update: fixed case if multiple minutes missing

donnyyy
  • 452
  • 3
  • 11
  • I think the for loop should be nested. In the code "for date, open, high, low, close, zero in data:", 'data' is a list of tuples which would require further processing. – Vineet Jun 27 '18 at 15:04
  • 1
    What do you mean? In each iteration for loop receives a tuple of ``data`` and unpacks it into ``date, open, high, low, close, zero``, isn't it? – donnyyy Jun 27 '18 at 15:14
  • Oh. Sorry. I thought i need to do it in two shots. First 'for loop' to get tuples and another 'for loop' to parse its items. But clarified my concept now. Thanks. – Vineet Jun 27 '18 at 15:17
  • Another point:- this code deals with data for one date (not talking about datetime, but data for one day). There would be data for multiple days in same file. When the code encounters next day, it would insert hundreds of rows because of the time difference. How do I avoid it? – Vineet Jun 27 '18 at 15:25
  • 1
    You can set a threshold to avoid tons of insertions: change if condition to ``prev_date is not None and date - prev_date > timedelta(minutes=1) and date - prev_date < timedelta(...)``. Or you can compare days: ``prev_date is not None and date - prev_date > timedelta(minutes=1) and date.day == prev_date.day`` – donnyyy Jun 27 '18 at 15:33
  • thanks. It works great. I will try to modify it to update it into the csv file; and not into separate list object. – Vineet Jun 27 '18 at 15:42