0

I have a dataset below

2018-02-11 17:09:47.211344+05:30,,
2018-02-11 17:10:13.553385+05:30,,
2018-02-11 17:30:13.58636+05:30,,
2018-02-11 18:00:13.630383+05:30,,
2018-02-11 18:30:13.558238+05:30,,
2018-02-12 03:50:19.298678+05:30,,
2018-02-12 04:53:17.187277+05:30,,
2018-02-12 05:10:25.443962+05:30,,
2018-02-12 05:20:21.591291+05:30,,
2018-02-13 06:41:54.234258+05:30,,
2018-02-17 07:04:10.662735+05:30,,
2018-02-20 05:34:39.855528+05:30,,

I want to select a group of entries between two specific dates, for example between 2018-02-11 and 2018-02-17.Now I followed this stackoverflow question Select dataframe rows between two dates and it seems to work except it doesn't return all the entries for the specified range.This is what I did as suggested in the answers

import pandas as pd
import numpy as np

df = pd.read_csv("file.csv", header=None,
                 names=["date", "entry", "exit"], parse_dates=["date"])
df.set_index("date", inplace=True)
#df = df.set_index(['date'])
df.fillna(0, inplace=True)
print(df.loc['2018-02-11': '2018-02-17'])

This is the result below

                               entry    exit
date                                        
2018-02-11 11:39:47.211344     0.0       0.0
2018-02-11 11:40:13.553385     0.0       0.0
2018-02-11 12:00:13.586360     0.0       0.0
2018-02-11 12:30:13.630383     0.0       0.0
2018-02-11 13:00:13.558238     0.0       0.0
2018-02-11 22:20:19.298678     0.0       0.0
2018-02-11 23:23:17.187277     0.0       0.0
2018-02-11 23:40:25.443962     0.0       0.0
2018-02-11 23:50:21.591291     0.0       0.0
2018-02-13 01:11:54.234258     0.0       0.0
2018-02-17 01:34:10.662735     0.0       0.0

As you can see, I don't get the entries for the date 2018-02-12.Why is it leaving this out?

I even tried another method

print(df[(df.index > '2018-02-11') & (df.index <= '2018-02-17')])

But I still get the same result.So what is going wrong?

Souvik Ray
  • 2,899
  • 5
  • 38
  • 70
  • It's not leaving that out. The dates are rounded. The number of rows says it all. – Bharath M Shetty Apr 12 '18 at 16:53
  • 3
    It is your `+05:30` that is shifting your entire `2018-02-12` back to `2018-02-11` – piRSquared Apr 12 '18 at 16:58
  • 1
    Look at your first date in result df `2018-02-11 11:39:47.211344`. Add `+5:30` and you get `2018-02-11 17:09:47.211344` , your original row. All the dates for `2018-02-12` are converted to `2018-02-11` since `+5:30` is not counted in result. In gist, code works as expected. You have to interpret the results correctly. – harvpan Apr 12 '18 at 17:21
  • @piRSquared thanks for pointing that out! – Souvik Ray Apr 12 '18 at 17:33

1 Answers1

1

Your input file has a time zone offset of +05:30 at the end of each datetime string. Pandas is automatically applying that offset on import, but produces timezone-naïve datetime objects.

One fix (if you actually need timezone-aware datetimes...) is to localize to UTC and then re-apply the +05:30 offset by converting to a timezone that carries that offset, such as Asia/Kolkata:

df = df.tz_localize('UTC').tz_convert('Asia/Kolkata')
df.loc['2018-02-11': '2018-02-17']
                                   v1   v2
date                                      
2018-02-11 17:09:47.211344+05:30  0.0  0.0
2018-02-11 17:10:13.553385+05:30  0.0  0.0
2018-02-11 17:30:13.586360+05:30  0.0  0.0
2018-02-11 18:00:13.630383+05:30  0.0  0.0
2018-02-11 18:30:13.558238+05:30  0.0  0.0
2018-02-12 03:50:19.298678+05:30  0.0  0.0
2018-02-12 04:53:17.187277+05:30  0.0  0.0
2018-02-12 05:10:25.443962+05:30  0.0  0.0
2018-02-12 05:20:21.591291+05:30  0.0  0.0
2018-02-13 06:41:54.234258+05:30  0.0  0.0
2018-02-17 07:04:10.662735+05:30  0.0  0.0
Peter Leimbigler
  • 10,775
  • 1
  • 23
  • 37