split the date range into multiple ranges

Question

I have data in CSV like this:

1940-10-01,somevalue
1940-11-02,somevalue
1940-11-03,somevalue
1940-11-04,somevalue
1940-12-05,somevalue
1940-12-06,somevalue
1941-01-07,somevalue
1941-02-08,somevalue
1941-03-09,somevalue
1941-05-01,somevalue
1941-06-02,somevalue
1941-07-03,somevalue
1941-10-04,somevalue
1941-12-05,somevalue
1941-12-06,somevalue
1942-01-07,somevalue
1942-02-08,somevalue
1942-03-09,somevalue

I want to separate the dates from 1-oct-year to 31-march-next-year for all data. So for data above output will be:

1940/1941:

1940-11-02,somevalue
1940-11-03,somevalue
1940-11-04,somevalue
1940-12-05,somevalue
1940-12-06,somevalue
1941-01-07,somevalue
1941-02-08,somevalue
1941-03-09,somevalue

1941/1942:

1941-10-04,somevalue
1941-12-05,somevalue
1941-12-06,somevalue
1942-01-07,somevalue
1942-02-08,somevalue
1942-03-09,somevalue
1942-10-01,somevalue

My code trails are:

import csv
from datetime import datetime

with open('data.csv','r') as f:
    data = list(csv.reader(f))

quaters = []
year =  datetime.strptime(data[0][0], '%Y-%m-%d').year
for each in data:
    date =  datetime.strptime(each[0], '%Y-%m-%d')
    print(each)        

    if (date>=datetime(year=date.year,month=10,day=1) and date<=datetime(year=date.year+1,month=3,day=31)):
        middle_quaters[-1].append(each)
    if year != date.year:            
        quaters.append([])

But I am not getting expected output. I want to store each range of dates in separate list.

isn't your sample result have error? Why result for 1941/42 has a record from year 1940? See my answer with correct outputs. — Kaushal28, Nov 01 '19 at 20:04

Yatish Kadam · Answer 1 · 2019-11-01T19:55:24.443

0

I would use pandas dataframe to do this.. it would be easier.. follow this Pandas: Selecting DataFrame rows between two dates (Datetime Index)

so for your case

data = pd.read_csv("data.csv")
df.loc[startDate : endDate]



# you can walk through a bunch of ranges like so..
listOfDateRanges = [(), (), ()]
for date_range in listOfDateRanges:
   df.loc[date_range[0] : date_range[1]]

edited Nov 01 '19 at 19:55

answered Nov 01 '19 at 19:46

Yatish Kadam

454
2
11

but my date range is changing like it can lie in any year, so i cant hardcode it – Ayyan Khan Nov 01 '19 at 19:51
where are you hardcoding the values? – Yatish Kadam Nov 01 '19 at 19:51
your startDate and endDate can be anything you want.. have them in a list.. as a tuple.. and walk through the range to get the required dates.. – Yatish Kadam Nov 01 '19 at 19:52
@Kaushal28 what do you mean? Its basically a filtered argument you are passing.. – Yatish Kadam Nov 04 '19 at 15:12

Kaushal28 · Answer 2 · 2019-11-01T20:00:18.573

For this purpose you can use pandas library. Here is the sample code for the same:

import pandas as pd
df = pd.read_csv('so.csv', parse_dates=['timestamp'])   #timestamp is your time column
current_year, next_year = 1940, 1941
df = df.query(f'(timestamp >= "{current_year}-10-01") & (timestamp <= "{next_year}-03-31")')
print (df)

This gives following result on your data:

   timestamp      value
0 1940-10-01  somevalue
1 1940-11-02  somevalue
2 1940-11-03  somevalue
3 1940-11-04  somevalue
4 1940-12-05  somevalue
5 1940-12-06  somevalue
6 1941-01-07  somevalue
7 1941-02-08  somevalue
8 1941-03-09  somevalue

Hope this helps!

score 0 · Answer 3 · answered Nov 01 '19 at 19:59

Without external packages... create a lookup based on the field of choice, and then make an int of it and do a less that vs greater than to establish the range.

import re

data = '''1940-10-01,somevalue
1940-11-02,somevalue
1940-11-03,somevalue
1940-11-04,somevalue
1940-12-05,somevalue
1940-12-06,somevalue
1941-01-07,somevalue
1941-02-08,somevalue
1941-03-09,somevalue
1941-05-01,somevalue
1941-06-02,somevalue
1941-07-03,somevalue
1941-10-04,somevalue
1941-12-05,somevalue
1941-12-06,somevalue
1942-01-07,somevalue
1942-02-08,somevalue
1942-03-09,somevalue'''

lookup={}
lines = data.split('\n')
for line in lines:
    d = re.sub(r'-','',line.split(',')[0])
    lookup[d]=line

dates=sorted(lookup.keys())

_in=19401201
out=19411004
outfile=[]
for date in dates:
    if int(date) > _in and int(date) < out:
        outfile.append(lookup[date])

for l in outfile:
    print outfile

What is the input is stored in file? This won't be optimised approach to first converting `csv` to string and then apply integer operations to determine date time ranges. — Kaushal28, Nov 01 '19 at 20:05

split the date range into multiple ranges

3 Answers3