0

I originally posted a question about plotting different datetime-sampling in the same plot, stored in many different dataframes.

I got help understanding I needed to convert my time-column (‘ts’) to datetime. I struggled with this, still getting messed up plots. Turns out my conversion to datetime isn’t working, and this is a known thing, as stated here.

A dataframe can’t store datetime in a column (why??), it converts it back to pandas._libs.tslibs.timestamps.Timestamp.

I need to figure out the best work around this to be able to plot large datasets.

In the post above, it is stated that dataframe index can store datetime format, but when I set my column as index, and try to loop through, I get key error.

 In[]: df.index.name 
 Out[]: ‘ts’

but when I try:

for column in df.columns[1:]:
    df['ts'] = pd.to_datetime(df['ts'])

I get KeyError: 'ts'

Am I doing something wrong here? Does anyone know if datetime is stored correctly in the index?

However, I would still like to ask about the best work-around for this issue.

My bottom line is wanting to plot several dataframes correctly in the same plot. I have a lot of large datasets, and when trying out things, I am using two simplified dataframes, see below:

print(df1)
                        ts  value
0  2019-10-18 08:13:26.702     14
1  2019-10-18 08:13:26.765     10
2  2019-10-18 08:13:26.790      5
3  2019-10-18 08:13:26.889      6
4  2019-10-18 08:13:26.901      8
5  2019-10-18 08:13:27.083     33
6  2019-10-18 08:13:27.098     21
7  2019-10-18 08:13:27.101     11
8  2019-10-18 08:13:27.129     22
9  2019-10-18 08:13:27.159     29
10 2019-10-18 08:13:27.188      7
11 2019-10-18 08:13:27.212     20
12 2019-10-18 08:13:27.228     24
13 2019-10-18 08:13:27.246     30
14 2019-10-18 08:13:27.395     34
15 2019-10-18 08:23:26.375     40
16 2019-10-18 08:23:26.527     49
17 2019-10-18 08:23:26.725     48

print(df2)
                       ts  value
0 2019-10-18 08:23:26.375     27
1 2019-10-18 08:23:26.427     17
2 2019-10-18 08:23:26.437      4
3 2019-10-18 08:23:26.444      2
4 2019-10-18 08:23:26.527     39
5 2019-10-18 08:23:26.575     25
6 2019-10-18 08:23:26.662      6
7 2019-10-18 08:23:26.676     14
8 2019-10-18 08:23:26.718     11
9 2019-10-18 08:23:26.725     13

What is the best way to achieve the result I am looking for?

I have tried converting ‘ts’ column to both array and list, but nothing seem to bring me closer to a final working result for plotting the datasets together. Converting to datetime in array gives me numpy.datetime64, converting to datetime in list gives me pandas._libs.tslibs.timestamps.Timestamp.

Any help is highly appreciated as this is really driving me crazy.

If needed, my original 'ts' values read from avro files are of type:

 '2019-10-18T08:13:27.098000'

Running:

df['ts'] = pd.to_datetime(df['ts'])

returns

'2019-10-18 08:13:27.098'  (pandas._libs.tslibs.timestamps.Timestamp)

EDIT 1

Further information about my steps, this is my df after reading the avro files:

enter image description here

This is my df after first attempt to turn the format into datetime, returns timestamp:

enter image description here

This is what my df looks like after setting 'ts' as index:

enter image description here

I then try to turn the timestamp to datetime when it's in the index, I get keyError:

enter image description here

Miss.Pepper
  • 103
  • 10
  • When you set the index to the 'ts' column are you leaving the ts column in the data frame or are you dropping the ts column? What does your dataframe look like after setting the index to the ts column? – itprorh66 Oct 16 '20 at 14:25
  • I edited the post, adding some pictures. I have no clue how to solve the overall task here. If datetime in index works, and I can get the conversion working for the index column, then I think my problem is solved. If not, I still need a work around. Can't understand why datetime can't be stored in a df column. – Miss.Pepper Oct 16 '20 at 18:32
  • Maybe there is a misunderstanding here: an individual element of a pd.Series of dtype datetime64 will be of class pd._libs.tslibs.timestamps.Timestamp. Also, if you set a column as index by `df = df.set_index('ts')`, `'ts'` is now the index and you can't call it as a column. So that is *expected* behavior. If you want to keep it also as a column, use `df = df.set_index('ts', drop=False)` – FObersteiner Oct 16 '20 at 19:39
  • Please use [`DataFrame.dtypes`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.dtypes.html) not `type` at a specific data point with `.at`. You will see your `ts` column is `datetime64[ns]`. Also, remove unneeded loop. – Parfait Oct 16 '20 at 21:17
  • dtypes gives med 'ts' is datetime64[ns], I am not closer with regards to plotting based on this data. What am I doing wrong? Plotting still returns a figure like the last picture in this post (https://stackoverflow.com/questions/64065968/plot-dataframes-with-different-datetime-sampling-in-the-same-plot?noredirect=1#comment113871079_64065968) I read something about datetime64 not being a 'plotable' format, do you know if this is true? How to work around this? – Miss.Pepper Oct 21 '20 at 07:58

1 Answers1

0

I guess I am having trouble figuring out what you are asking. Given a df of the form:

    ts  value
0   2019-10-18 08:13:26.702 14
1   2019-10-18 08:13:26.765 10
2   2019-10-18 08:13:26.790 5
3   2019-10-18 08:13:26.889 6
4   2019-10-18 08:13:26.901 8
5   2019-10-18 08:13:27.083 33

I can execute the following to convert the ts column to a pd.datetime varaible and make the ts column the index:

df['ts'] = pd.to_datetime(df['ts'])
df = df.set_index(['ts'], drop=True)

which yields the df of form

                       value
       ts   
2019-10-18 08:13:26.702 14
2019-10-18 08:13:26.765 10
2019-10-18 08:13:26.790 5
2019-10-18 08:13:26.889 6
2019-10-18 08:13:26.901 8

I can then print the values of the index, or for that matter use any iteration on the index I want. The following just gives the first 5 values.

for i in range(5):
    print(df.iloc[i].name)

2019-10-18 08:13:26.702000
2019-10-18 08:13:26.765000
2019-10-18 08:13:26.790000
2019-10-18 08:13:26.889000
2019-10-18 08:13:26.901000
itprorh66
  • 3,110
  • 4
  • 9
  • 21
  • I agree with all of this. And dtypes shows that 'ts' column is datetime64[ns], but plotting does not work! Can datetime64 be plotted directly? If yes, how? If I keep 'ts' as a column, or if I make 'ts' the index, it returns the same wrong plot. Shown in this post (https://stackoverflow.com/questions/64065968/plot-dataframes-with-different-datetime-sampling-in-the-same-plot?noredirect=1#comment113871079_64065968) – Miss.Pepper Oct 21 '20 at 07:53
  • Can you explain to me what is wrong with the plot? I see a plot with an x axis showing time and a y axis showing values. What part of this plot is wrong? – itprorh66 Oct 21 '20 at 12:30
  • If you look at the data in the dataframe, shown in the picture above, df1, you will see nothing is correct about the plot. Time on x-axis is 'made up' seconds, it is not looking at my actual timestamps. The values on the y-axis is not correct either, but this may be due to the fact that time/date isn't interpreted correct? – Miss.Pepper Oct 21 '20 at 13:41