0

My timeseries dates are being jumbled (day/month) when I assign them as a datetimeindex. Sees odd that parser could get it so wrong, but have tried declaring format and using Dayfirst but nothing working.

#input_data = pd.read_csv(url)
input_data = pd.read_csv(url,usecols=['Dates','TYAFWD Comdty'],skiprows=None, parse_dates=True, nrows=1500)

# Set Date as Index, clean dataframe
input_data = input_data.set_index('Dates')
df = pd.DataFrame(input_data).dropna()
print(df.columns)

# Create new Date index
data_time = pd.to_datetime(df.index)
datetime_index = pd.DatetimeIndex(data_time.values) 
df = df.set_index(datetime_index)
df.index = pd.to_datetime(df.index, infer_datetime_format='%Y/%m/%d' )


df['year'] = pd.DatetimeIndex(df.index).year
df['month'] = pd.DatetimeIndex(df.index).month
df['week'] = pd.DatetimeIndex(df.index).weekofyear
print(df.head(30))

Can see from the output that it is all mixed up. I would expect all the entries in the output to be in May, the 5th month, but it is flipping the dates once <12

Here is my raw data: https://raw.githubusercontent.com/esheehan1/projects/master/BB_FUT_DATA.csv

Index(['TYAFWD Comdty'], dtype='object')
            TYAFWD Comdty  year  month  week
2020-05-26          0.508  2020      5    22
2020-05-25          0.494  2020      5    22
2020-05-22          0.494  2020      5    21
2020-05-21          0.508  2020      5    21
2020-05-20          0.512  2020      5    21
2020-05-19          0.512  2020      5    21
2020-05-18          0.552  2020      5    21
2020-05-15          0.483  2020      5    20
2020-05-14          0.474  2020      5    20
2020-05-13          0.494  2020      5    20
2020-12-05          0.510  2020     12    49
2020-11-05          0.548  2020     11    45
2020-08-05          0.527  2020      8    32
2020-07-05          0.494  2020      7    27
2020-06-05          0.568  2020      6    23
2020-05-05          0.541  2020      5    19
ES Kepler
  • 15
  • 3

3 Answers3

1

It's always a good idea to take at look at a bit of your raw data before writing code to operate on it (if you can).

In your particular case the date format is D/M/Y which is the International and European standard. The function pd.read_csv uses American date format M/D/Y by default.

Changing this using the parameter dayfirst=True will give your desired output. Also, I've shortened your code a bit:

import pandas as pd
from datetime import date
url = 'https://raw.githubusercontent.com/esheehan1/projects/master/BB_FUT_DATA.csv'
df = pd.read_csv(url, usecols=['Dates','TYAFWD Comdty'], index_col=['Dates'], skiprows=None, parse_dates=True, dayfirst=True, nrows=1500)
print(df.iloc[15:20,:])

            TYAFWD Comdty
Dates                    
2020-05-05          0.541
2020-05-04          0.527
2020-05-01          0.512
2020-04-30          0.528
2020-04-29          0.521

Add your desired columns:

df['year']  = pd.to_datetime(df.index).year
df['month'] = pd.to_datetime(df.index).month
df['week']  = pd.to_datetime(df.index).weekofyear
print(df.iloc[15:20,:])

            TYAFWD Comdty  year  month  week
Dates                                       
2020-05-05          0.541  2020      5    19
2020-05-04          0.527  2020      5    19
2020-05-01          0.512  2020      5    18
2020-04-30          0.528  2020      4    18
2020-04-29          0.521  2020      4    18

Check out the pandas documentation for pd.read_csv there are many parameters you may find useful!

eNc
  • 1,021
  • 10
  • 23
0

I can't see anything wrong with your output.

Seems like the default behavior of .to_datetime(). It fomats by default in descending order year, month, day. It's the "standard".

However, if you want to make sure the data is transformed properly, use the argument format;

df.index = df.index.to_datetime(format='%d/%m/%Y')
# that's it
Sy Ker
  • 2,047
  • 1
  • 4
  • 20
0

The default datetime formatting in pd.read_csv causes your problem, since it assumes that /-separated formatting is %m/%d/%Y. I'd also suggest you simplify your code a bit, since at the moment there are a lot of unnecessary casting operations:

import pandas as pd

# Result is a DataFrame already
df = pd.read_csv('BB_FUT_DATA.csv', usecols=['Dates', 'TYAFWD Comdty'], skiprows=None, nrows=1500)
df.dropna(inplace=True)
df.Dates = pd.to_datetime(df.Dates, format='%d/%m/%Y')
df.set_index('Dates', inplace=True)
# Since df.index is already of type datetime you can access the year, month, weekofyear attributes directly
df['year'] = df.index.year
df['month'] = df.index.month
df['week'] = df.index.weekofyear
print(df.head(30))

Alternatively you can do this completely in pd.read_csv using dayfirst=True (as @eNc pointed out) or date_parser=lambda x: pd.to_datetime(x, format='%d/%m/%Y') and na_filter for dropping rows with NaN and NA values:

import pandas as pd


df = pd.read_csv(
    'BB_FUT_DATA.csv',
    usecols=['Dates', 'TYAFWD Comdty'],
    parse_dates=True,
    dayfirst=True,
    skiprows=None,
    nrows=1500,
    index_col='Dates',
    na_filter=True
    )
df['year'] = df.index.year
df['month'] = df.index.month
df['week'] = df.index.weekofyear
print(df.head(30))

Output:

            TYAFWD Comdty  year  month  week
Dates                                       
2020-05-26          0.508  2020      5    22
2020-05-25          0.494  2020      5    22
2020-05-22          0.494  2020      5    21
2020-05-21          0.508  2020      5    21
2020-05-20          0.512  2020      5    21
2020-05-19          0.512  2020      5    21
2020-05-18          0.552  2020      5    21
2020-05-15          0.483  2020      5    20
2020-05-14          0.474  2020      5    20
2020-05-13          0.494  2020      5    20
2020-05-12          0.510  2020      5    20
2020-05-11          0.548  2020      5    20
2020-05-08          0.527  2020      5    19
2020-05-07          0.494  2020      5    19
2020-05-06          0.568  2020      5    19
2020-05-05          0.541  2020      5    19
2020-05-04          0.527  2020      5    19
2020-05-01          0.512  2020      5    18
2020-04-30          0.528  2020      4    18
2020-04-29          0.521  2020      4    18
2020-04-28          0.519  2020      4    18
2020-04-27          0.559  2020      4    18
2020-04-24          0.518  2020      4    17
2020-04-23          0.512  2020      4    17
2020-04-22          0.514  2020      4    17
2020-04-21          0.474  2020      4    17
2020-04-20          0.490  2020      4    17
2020-04-17          0.521  2020      4    16
2020-04-16          0.510  2020      4    16
2020-04-15          0.498  2020      4    16
kampmani
  • 680
  • 5
  • 13