1

I have 31-day time series dataset which I extract using the below query.

import pymysql
import pymysql.cursors
import pandas as pd
import sys
import csv

conn = pymysql.connect(host='',user='',password='',db='',port='',charset='utf8mb4',cursorclass=pymysql.cursors.DictCursor)

dbquery = """SELECT * FROM 'table_name' where 'date' between ('2012-03-01 00:00:00') and ('2012-03-31 23:59:59') """

df = pd.read_sql_query(dbquery, conn)
df.to_csv('one_month_timeseries_data.csv', sep=',', encoding='utf-8')

The extracted dataframe is structured as below.

id date                  value
1  2012-01-01 00:00:00   33
5  2012-01-02 00:00:01   15
.
.
.
.
2  2012-01-30 23:59:58   10
5  2012-01-31 23:59:59   5

I want to further loop through each day (00:00:00 - 23:59:59) through the entire dataset and split the dataframe into 3 separate dataframes of 8 hour intervals each grouped into sleep (00:00:00 - 07:59:59), work (08:00:00 - 15:59:59), home (16:00:00 - 23:59:59).

Expected output:

e.g., for day 1:

day1_df_sleep:

id date                  value
1  2012-01-01 00:00:00   33
.
.
.
.
3  2012-01-01 07:59:59   10

day1_df_work:

id date                   value
1  2012-01-01 08:00:00    12
.
.
.
.
4  2012-01-01 15:59:59    50

day1_df_home:

id date                   value
1  2012-01-01 16:00:00    12
.
.
.
.
4  2012-01-01 23:59:59    50

etc..until day 31

I have read about pandas' date_range() function but I am not sure how best to implement it where I only have the date column.

How best can I implement this using pandas?

GKC
  • 447
  • 4
  • 10

1 Answers1

2

Use dt.day and dt.hour and create new dataframes by filtering by the different days and different times of the day. You cannot use between() as an alternative to .lt and .ge, because the lowerbound is inclusive while the upper bound is exclusive in your desired output. As a separate problem, you can loop through each day, and below is what you would do for each day, and you could set a variable for day when looping through. Looping through each day and creating dataframes for each day is obviously going to be a little bit more complicated.

I don't know the reason for creating separated dataframes by day, but why not just have a function with a varaible for day to return output, which you could call with a parameter for user input, writing it back to sql or excel, etc. without having to create separate dataframes.

df['date'] = pd.to_datetime(df['date'])
day1_df_sleep = df[df['date'].dt.day == 1 & df['date'].dt.hour.ge(0) & df['date'].dt.hour.lt(8)]
day1_df_work = df[df['date'].dt.day == 1 & df['date'].dt.hour.ge(8) & df['date'].dt.hour.lt(16)]
day1_df_home = df[df['date'].dt.day == 1 & df['date'].dt.hour.ge(16) & df['date'].dt.hour.lt(24)]

For example you could do something like this to loop:

df = d.copy()
df['date'] = pd.to_datetime(df['date'])
for i in range(1,31):
    day1_df_sleep = df[df['date'].dt.day == i & df['date'].dt.hour.ge(0) & df['date'].dt.hour.lt(8)]
    day1_df_work = df[df['date'].dt.day == i & df['date'].dt.hour.ge(8) & df['date'].dt.hour.lt(16)]
    day1_df_home = df[df['date'].dt.day == i & df['date'].dt.hour.ge(16) & df['date'].dt.hour.lt(24)]
    print(day1_df_sleep, day1_df_work, day1_df_home)
David Erickson
  • 16,433
  • 2
  • 19
  • 35
  • 1
    Thanks a lot David. It works partially. The problem is that it only keeps looping through the records for day 1 and does not continue to process the rest of the 30 days. – GKC Oct 07 '20 at 23:21
  • No problem @Gee it's a tougher solution to get the different dataframes. I know that there is a use case for many different dataframes, but I would envision a function with parameters that you could run in a for loop rather than all of the dataframes. Please check out the second answer with more votes here as well as the last answer: https://stackoverflow.com/questions/30635145/create-multiple-dataframes-in-loop – David Erickson Oct 07 '20 at 23:28
  • Thanks, and by the way the reason for creating separate dataframes is to compute several features per each interval of the day namely sleep, work, and home. – GKC Oct 07 '20 at 23:38
  • If use the **groupby** function, I get back the grouped dataframes for each of the 31 days; `df_grouped = df.groupby([df['date'].dt.date])` `for j, df_group in df_grouped:` `print (df_group)` But the challenge still remains on how to apply the for-loop that you suggested on each of the daily groups **(df_group)** in order to extract each day's dataframe intervals: sleep, work, and home. – GKC Oct 08 '20 at 13:09