6

I encountered this behaviour when doing basic data munging, like in this example:

In [55]: import pandas as pd
In [56]: import numpy as np
In [57]: rng = pd.date_range('1/1/2000', periods=10, freq='4h')
In [58]: lvls = ['A','A','A','B','B','B','C','C','C','C']
In [59]: df = pd.DataFrame({'TS': rng, 'V' : np.random.randn(len(rng)), 'L' : lvls})

In [60]: df
Out[60]: 
   L                  TS         V
0  A 2000-01-01 00:00:00 -1.152371
1  A 2000-01-01 04:00:00 -2.035737
2  A 2000-01-01 08:00:00 -0.493008
3  B 2000-01-01 12:00:00 -0.279055
4  B 2000-01-01 16:00:00 -0.132386
5  B 2000-01-01 20:00:00  0.584091
6  C 2000-01-02 00:00:00 -0.297270
7  C 2000-01-02 04:00:00 -0.949525
8  C 2000-01-02 08:00:00  0.517305
9  C 2000-01-02 12:00:00 -1.142195

the problem:

In [61]: df['TS'].min()
Out[61]: 31969-04-01 00:00:00

In [62]: df['TS'].max()
Out[62]: 31973-05-10 00:00:00

while this looks ok:

In [63]: df['V'].max()
Out[63]: 0.58409076701429163

In [64]: min(df['TS'])
Out[64]: <Timestamp: 2000-01-01 00:00:00>

when aggregating after groupby:

In [65]: df.groupby('L').min()
Out[65]: 
             TS         V
L                        
A  9.466848e+17 -2.035737
B  9.467280e+17 -0.279055
C  9.467712e+17 -1.142195

In [81]: val = df.groupby('L').agg('min')['TS']['A']
In [82]: type(val)
Out[82]: numpy.float64

Apparently in this particular case it has something to do with using frequency datetime index as argument of pd.Series function:

In [76]: rng.min()
Out[76]: <Timestamp: 2000-01-01 00:00:00>

In [77]: ts = pd.Series(rng)
In [78]: ts.min()
Out[78]: 31969-04-01 00:00:00

In [79]: type(ts.min())
Out[79]: numpy.datetime64

However, my initial problem was with min/max of Timestamp series parsed from strings via pd.read_csv()

What am I doing wrong?

tmthydvnprt
  • 10,398
  • 8
  • 52
  • 72
LukaszJ
  • 145
  • 2
  • 6
  • 1
    What version of Numpy are you using? There are problems with the representation of timestamps in 1.6. – meteore Oct 19 '12 at 12:55
  • It is all under Numpy 1.6.2 and Pandas 0.9.0, maintained via MacPorts. Following your hint, I have found [similar problem reports](https://groups.google.com/forum/#!msg/pystatsmodels/n1oBBVYI5FQ/DwuvAOc32yAJ), just trying to figure out how they relate to my situation. – LukaszJ Oct 19 '12 at 13:16
  • BTW, Meteore, were you (or anyone else) able to replicate this behaviour? – LukaszJ Oct 19 '12 at 14:16

1 Answers1

6

As @meteore points out, it's a problem with the string repr of the np.datetime64 type in NumPy 1.6.x. The underlying data, should still be correct. To workaround this problem, you can do something like:

In [15]: df
Out[15]: 
   L                  TS         V
0  A 2000-01-01 00:00:00  0.752035
1  A 2000-01-01 04:00:00 -1.047444
2  A 2000-01-01 08:00:00  1.177557
3  B 2000-01-01 12:00:00  0.394590
4  B 2000-01-01 16:00:00  1.835067
5  B 2000-01-01 20:00:00 -0.768274
6  C 2000-01-02 00:00:00 -0.564037
7  C 2000-01-02 04:00:00 -2.644367
8  C 2000-01-02 08:00:00 -0.571187
9  C 2000-01-02 12:00:00  1.618557

In [16]: df.TS.astype(object).min()
Out[16]: datetime.datetime(2000, 1, 1, 0, 0)

In [17]: df.TS.astype(object).max()
Out[17]: datetime.datetime(2000, 1, 2, 12, 0)
Chang She
  • 16,692
  • 8
  • 40
  • 25
  • True, but the result of min/max should be boxed as Timestamps. I'll open an issue: http://github.com/pydata/pandas/issues/2083 – Wes McKinney Oct 19 '12 at 14:45
  • Thanks! So I should just bypass Timestamp representation until the issue is resolved? Only in this particular case or can this kind of nuisance be generalized? – LukaszJ Oct 19 '12 at 15:41
  • Timestamp is fine. The problem is the numpy.datetime64 representation. Timestamp is actually a subclass of python datetime and as @Wes_Mckinney mentioned, we'll be putting in a fix to box the numpy.datetime64 return value as a Timestamp so it can be represented correctly. – Chang She Oct 19 '12 at 16:06
  • The problem with datetime64 is more insidious than it looks: In [5]: pd.Timestamp(df.TS.min()) Out[5]: – Wes McKinney Oct 20 '12 at 14:54
  • 3
    Has this been fixed yet? I'm having trouble getting a max timestamp from a series. – Luke Jan 11 '14 at 00:13