4

I'm using Python 2.7, panda 0.14.1-2, numpy 1.8.1-1. I have to use Python 2.7 because I'm coupling it with something that doesn't work on Python 3

I'm trying to analyze a csv files that outputs Month, Day and Hour in separate columns, and would look something like the following:

Month Day Hour Value 1 1 1 105 1 1 2 30 1 1 3 85 1 1 4 52 1 1 5 65

I basically want to create a timestamp from those columns, and use "2005" as the year, and set this new timestamp column to be the index. I've read a lot of similar questions (here and here) but they all rely on doing during read_csv(). I don't have a year column, so I don't think this applies to me (aside from loading dataframe, inserting column, writing, and redoing read_csv... seems convoluted).

After loading the dataframe, I insert a Year column in position 0 df.insert(0, "Year", 2005)

So now I've got

Year Month Day Hour Value 2005 1 1 1 105 2005 1 1 2 30 2005 1 1 3 85 2005 1 1 4 52 2005 1 1 5 65 df.types tells me that all columns are int64 types.

Then I've tried doing this:

df['Datetime'] = pd.to_datetime(df.Year*1000000 + df.Month*10000 + df.Day+100 + df.Hour, format="%Y%M%d%H")

But I'm getting "TypeError: 'long' object is unsliceable"

On the other hand, the following runs without errors.

df['Datetime'] = pd.to_datetime(df.Year*10000 + df.Month*100 + df.Day, format="%Y%M%d")

As 2.7 doesn't like the %Y%M%d%H, as pointed by @EdChum, I've tried doing it in two steps: creating a datetime without hours, and adding the hours after. But: the output is not what I wanted

In [1]: # Do it without hours first (otherwise doesn't work in Python 2.7)
df['Datetime'] = pd.to_datetime(df.Year*10000 + df.Month*100 + df.Day, format="%Y%M%d")

In [2]: df['Datetime']
Out [2]:
0    2005-01-01 00:01:00
1    2005-01-01 00:01:00
...
13   2005-01-01 00:01:00
14   2005-01-01 00:01:00
...
8745   2005-01-31 00:12:00
8746   2005-01-31 00:12:00
...
8758   2005-01-31 00:12:00
8759   2005-01-31 00:12:00

8758 is supposed to be 2005-12-31 for example. What is wrong with that?

Once I resolve that, I'll be able to re-add the hours:

In [3]: # Then add the hours
df['Datetime'] = df['Datetime'] + pd.to_timedelta(df['Hour'], unit="h")
Community
  • 1
  • 1
Julien Marrec
  • 11,605
  • 4
  • 46
  • 63
  • 1
    What version pandas, numpy and python are you using? It works for me on pandas 0.14.1, numpy 1.8.1 and python 3.3.5 64-bit – EdChum Oct 01 '14 at 09:18
  • Works for me. Pandas 0.14.1, numpy 1.9.0, Python 3.4.0 64bit. – Ffisegydd Oct 01 '14 at 09:21
  • Python 2.7, panda 0.14.1-2, numpy 1.8.1-1. I have to use Python 2.7 because I'm coupling it with something that doesn't work on Python 3.x – Julien Marrec Oct 01 '14 at 09:24
  • If this is a bug in python 2.7 one workaround would be to do the conversion without hour and then after the conversion add the hour, could you try this – EdChum Oct 01 '14 at 09:27
  • Can you explain how do I add my integer 'Hour' time series to my 'Datetime' datetime64 series please? – Julien Marrec Oct 01 '14 at 09:34
  • You should be able to do something like this `df.index = df.index + pd.to_timedelta(df['Hour'] + ':00')` I think – EdChum Oct 01 '14 at 09:47
  • "TypeError: unsupported operand type(s) for +: 'numpy.ndarray' and 'str'" – Julien Marrec Oct 01 '14 at 09:56
  • `df['Datetime'] = df['Datetime'] + pd.to_timedelta(df['Hour'], unit="h")` seems to work though. I didn't realize before, but when I do `df['Datetime'] = pd.to_datetime(df.Year*10000 + df.Month*100 + df.Day, format="%Y%M%d")`, I've got minutes that change. Like it's 00:01:00 for the first days, and it's 00:12:00 for the last (365th) day. – Julien Marrec Oct 01 '14 at 09:58
  • @EdChum it seems there's a problem. I've updated the question. – Julien Marrec Oct 01 '14 at 11:10

2 Answers2

20

Letting the pandas parser do the heavy lifting (as in first answer) is obviously the best option if you are getting it from csv. If you are getting or calculating numbers in a different way try:

df['DateTime'] = df[['Year', 'Month', 'Day', 'Hour']].apply(lambda s : datetime.datetime(*s),axis = 1)

find that is still easy to read and very flexible.

Joop
  • 7,840
  • 9
  • 43
  • 58
  • Thanks, but I'm not sure I understand the lambda portion. Anyway, i'm getting "ValueError: ('hour must be in 0..23', u'occurred at index 23')". My "0" is actually "24" in the dataset. Is there an easy way to fix this? – Julien Marrec Oct 01 '14 at 14:18
  • `df['Hour'] = df['Hour'].replace(24,0)` before your snippet worked great! Thanks a lot – Julien Marrec Oct 01 '14 at 14:22
  • Arg. The problem is that now I have midnight not correct. 2005-12-30 00:00:00 comes after 2005-12-30 23:00:00, which isn't what it's supposed to do! But I guess the problem is my dataset... – Julien Marrec Oct 01 '14 at 14:27
  • 1
    datetime.datetime takes arguments (year, month, day, etc.) the just unpacks the columns that you pass so that it calculates individually. credit to one of pandas developers (think it was Andy Haydn), got code from his answer while back, can't find link to it now. I use same structure to be able to do excel like table formulas in DataFrame. – Joop Oct 01 '14 at 14:28
  • If woring with hourly data can subtract one minute and then roundTime function. might have to do a bit more manipulation to clean things up – Joop Oct 01 '14 at 14:33
  • What I like about this solution is that `df[[]].apply(lambda ...)` can be used for any function that doesn't intrinsically support pandas broadcasting (e.g. `datetime`). – fearless_fool Mar 14 '21 at 20:40
11

You could parse the input text in your question using pandas.read_csv():

#!/usr/bin/env python
from datetime import datetime
import pandas as pd

print(pd.read_csv(
    'input.txt', sep=r'\s+', parse_dates=[[0, 1, 2]],
    date_parser=lambda *columns: datetime(2005, *map(int, columns)),
    index_col=0))

Output

                     Value
Month_Day_Hour            
2005-01-01 01:00:00    105
2005-01-01 02:00:00     30
2005-01-01 03:00:00     85
2005-01-01 04:00:00     52
2005-01-01 05:00:00     65
jfs
  • 399,953
  • 195
  • 994
  • 1,670
  • Thanks for the answer. But I'm getting and error. "ValueError: month must be in 1..12". Ideas? – Julien Marrec Oct 01 '14 at 14:15
  • it means that month must be in 1..12 (including) range e.g., `0` is invalid. – jfs Oct 01 '14 at 15:05
  • My month column is in 1...12. It doesn't work when I'm using EXACTLY the data I pasted in my question. Are you using Python 3.X? – Julien Marrec Oct 03 '14 at 11:05
  • It works with the data from your question (how do you think I've produced the output in the answer). The code works on both Python 2 and 3 without any changes. `pandas` is installed using `sudo apt-get install python{,3}-pandas` – jfs Oct 03 '14 at 11:10
  • My bad, it does work. I don't know what I was doing wrong, but now it's fine. I'll accept this answer because it does it in one step while loading, so it's probably faster. Thanks for the help! – Julien Marrec Oct 03 '14 at 11:20