9

This behavior seems odd to me: the id column (a string) gets converted to a timestamp upon transposing the df if the other column is a timedelta.

import pandas as pd
df = pd.DataFrame({'id': ['00115', '01222', '32333'],
                   'val': [12, 14, 170]})
df['val'] = pd.to_timedelta(df.val, unit='Minutes')

print(df.T)
#                         0                      1                      2
#id  0 days 00:00:00.000000 0 days 00:00:00.000001 0 days 00:00:00.000032
#val      365 days 05:49:12      426 days 02:47:24     5174 days 06:27:00

type(df.T[0][0])
#pandas._libs.tslib.Timedelta

Without the timedelta it works as I'd expect, and the id column remains a string, even though the other column is an integer and all of the strings could be safely cast to integers.

df2 = pd.DataFrame({'id': ['00115', '01222', '32333'],
                    'val': [1, 1231, 1413]})

type(df2.T[0][0])
#str

Why does the type of id get changed in the first instance, but not the second?

ALollz
  • 57,915
  • 7
  • 66
  • 89
  • 1
    It looks like a bug. I answered a question similar to this one a while ago, see here: https://stackoverflow.com/questions/38470550/why-is-a-sum-of-strings-converted-to-floats/38470963#38470963. Fundamentally, the way that Pandas deals with datatypes is a bit messy. – Matt Messersmith Jun 15 '18 at 20:21
  • 3
    This seems to happen because `df.id` is of dtype `object`. Then, because `df.T` will have mixed types. pandas'll try to infer which type the columns will have, and ends up choosing `timedelta` in this case. If you had dtype `int`, `bytes` or any other specified object, this wouldnt happen.. Now, why the `dtype` is `object` and not `str` has already been asked/answered here https://stackoverflow.com/questions/21018654/strings-in-a-dataframe-but-dtype-is-object/21020411 , and is quite confusing to me honestly – rafaelc Jun 15 '18 at 20:24
  • 1
    I just thought it was odd it tried to choose in the timedelta case and not the int case. If one of the `id`s cant be converted to timedelta they all remain strings. – ALollz Jun 15 '18 at 20:27

2 Answers2

6

A dataframe should be thought of in columns. Each column must have a single data type. When you transpose, you are changing which cells are now associated with each other in the new columns. Prior to transpose, you had an string column and a timedelta column. After transpose, each column had a string and a timedelta. Pandas has to decide how to cast the new columns. It decided to go with timedelta. It is my opinion that this is a goofy choice.

You can change this behavior by changing the dtype on a newly constructed dataframe.

pd.DataFrame(df.values.T, df.columns, df.index, dtype=object)

                     0                  1                   2
id               00115              01222               32333
val  365 days 05:49:12  426 days 02:47:24  5174 days 06:27:00
piRSquared
  • 285,575
  • 57
  • 475
  • 624
  • Nice answer. `It is my opinion that this is a goofy choice`: Has anyone documented how Pandas decides `dtype` for a series given different constituent types, e.g. what priority logic (if any)? It's probably hidden away in obfuscated code, but this issue comes up quite often. – jpp Jun 15 '18 at 20:36
  • I can show an example of it choosing `object` which I think is appropriate. But other times it casts ints to floats which I also think is appropriate. To answer your question, idk. I certainly haven't. – piRSquared Jun 15 '18 at 20:39
  • It is very messy. If you have `df['id'] = df.is.astype(bytes)` first, then it'd just cast to `object`. Seems so arbitrary – rafaelc Jun 15 '18 at 20:41
  • Technically, it isn't arbitrary. It's written somewhere. It may be as simple as timedelta having a safe casting priority over string /shrug – piRSquared Jun 15 '18 at 20:43
-3

The point of using the method to_timedelta is to Convert argument to timedelta, per https://pandas.pydata.org/pandas-docs/stable/generated/pandas.to_timedelta.html. This will update the type.

The second time you never ran the to_timedelta method and the values are kept in their original state, as object (strings) for the table.

  • 1
    I don't think this answers why Pandas chooses to trigger conversion of '00115' to `timedelta` on transposing. – jpp Jun 15 '18 at 20:20