6

When accessing the DataFrame.values, all pd.Timestamp objects are converted to np.datetime64 objects, why? An np.ndarray containing pd.Timestamp objects can exists, therefore I don't understand why would such automatic conversion always happen.

Would you know how to prevent it?

Minimal example:

import numpy as np
import pandas as pd
from datetime import datetime

# Let's declare an array with a datetime.datetime object
values = [datetime.now()]
print(type(values[0]))
> <class 'datetime.datetime'>

# Clearly, the datetime.datetime objects became pd.Timestamp once moved to a pd.DataFrame
df = pd.DataFrame(values, columns=['A'])
print(type(df.iloc[0][0]))
> <class 'pandas._libs.tslibs.timestamps.Timestamp'>

# Just to be sure, lets iterate over each datetime and manually convert them to pd.Timestamp
df['A'].apply(lambda x: pd.Timestamp(x))
print(type(df.iloc[0][0]))
> <class 'pandas._libs.tslibs.timestamps.Timestamp'>

# df.values (or series.values in this case) returns an np.ndarray
print(type(df.iloc[0].values))
> <class 'numpy.ndarray'>

# When we check what is the type of elements of the '.values' array, 
# it turns out the pd.Timestamp objects got converted to np.datetime64
print(type(df.iloc[0].values[0]))
> <class 'numpy.datetime64'>


# Just to double check, can an np.ndarray contain pd.Timestamps?
timestamp = pd.Timestamp(datetime.now())
timestamps = np.array([timestamp])
print(type(timestamps))
> <class 'numpy.ndarray'>

# Seems like it does. Why the above conversion then?
print(type(timestamps[0]))
> <class 'pandas._libs.tslibs.timestamps.Timestamp'>

python : 3.6.7.final.0

pandas : 0.25.3

numpy : 1.16.4

Voy
  • 5,286
  • 1
  • 49
  • 59

2 Answers2

6

Found a workaround - using .array instead of .values (docs)

print(type(df['A'].array[0]))
> <class 'pandas._libs.tslibs.timestamps.Timestamp'>

This prevents the conversion and gives me access to the objects I wanted to use.

Voy
  • 5,286
  • 1
  • 49
  • 59
  • Thank you so much. I find the `.values` function a bit counter-intuitive: I would expect it to simply access the values in the series and not to do a conversion to the Numpy representation. `.array` is exactly what I was looking for, but again, to me is not an intuitive syntax: I would prefer to be able to access the values as a list; e.g.: `df['A'][0]`. – rubebop Aug 19 '20 at 01:07
  • For such indexing you can use [`.iat`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iat.html). – gosuto Feb 16 '21 at 18:27
5

The whole idea behind .values is to:

Return a Numpy representation of the DataFrame. [docs]

I find it logical that a pd.Timestamp is then 'downgraded' to a dtype that is native to numpy. If it wouldn't do this, what is then the purpose of .values?

If you do want to keep the pd.Timestamp dtype I would suggest working with the original Series (df.iloc[0]). I don't see any other way since .values uses np.ndarray to convert according to the source on Github.

gosuto
  • 5,422
  • 6
  • 36
  • 57
  • It appears using `.to_numpy()` instead, as the docs suggest, would create the possibility to coerce a certain `dtype` but this fails for me ("TypeError: data type "pd.Timestamp" not understood") – gosuto Nov 07 '19 at 16:53
  • 1
    But then why they state `The dtype will be a lower-common-denominator dtype`? If the Series contains only one type, why casting it? I guess they really wanna stress that *Numpy* representation. Could it be that I'm using a wrong function? How can one access the internal data structure of a DataFrame, without doing any casting? Or otherwise, how can one convert a DataFrame into an np.ndarray without casting? That - I believed - was the purpose of `.values`. – Voy Nov 14 '19 at 10:57
  • I don't think it is cast to `np.datetime64` because that is the lower-common-denominator dtype, but because `pd.Timestamp` is not part of `numpy`. A vector with `pandas` functionalities is a `Series` and converting it to a `numpy` arrray removes those `pandas` functionalities. – gosuto Nov 14 '19 at 17:26