0

I have a dataframe like this,

               datetime   id     value
0   2021-02-21 15:43:00  154  0.102677
1   2021-02-21 15:57:00  215  0.843945
2   2021-02-21 00:31:00  126  0.402851
3   2021-02-21 16:38:00   61  0.138945
4   2021-02-21 05:11:00  124  0.865435
..                  ...  ...       ...
115 2021-02-21 21:54:00  166  0.108299
116 2021-02-21 17:39:00  192  0.129267
117 2021-02-21 01:56:00  258  0.300448
118 2021-02-21 20:35:00  401  0.119043
119 2021-02-21 09:16:00  192  0.587173

which I can create by issuing,

import datetime
from numpy import random
#all minutes of the day, ordered, unique
d = pd.date_range("2021-02-21 00:00:00","2021-02-21 23:59:59", freq="1min")

d2 = pd.Series(d).sample(120,replace=True)
ids = random.randint(1,500,size=d2.shape[0])
df = pd.DataFrame({'datetime':d2,'id':ids,'value':random.random(size=d2.shape[0])})
df.reset_index(inplace=True,drop=True)

and I want to have it in a matrix with one index being the minute of the day and the other one being the id, so that I would have 1440*unique(ids).shape[0]

Please, note that, even if some minutes do not appear in the dataframe, the output matrix is 1440 anyways.

I can do it like this,

but this takes VERY long time. How can I better do it?

#all ids, unique
uniqueIds = df.id.unique()
idsN = ids.shape[0]
objectiveMatrix = np.zeros([1440,idsN])
mins = pd.date_range(start='2020-09-22 00:00', end='2020-09-23 00:00', closed=None, freq='1min')
for index, row in df.iterrows():
    a = np.where(row.id==uniqueIds)[0]
    b = np.where(row.datetime==d)[0]
    objectiveMatrix[b,a] = row.value   
myradio
  • 1,703
  • 1
  • 15
  • 25

1 Answers1

1

This is so-called pivot. Pandas has pivot, pivot_table, set_index/unstack for this. For more details, see this excellent guide. As a starter, you can try:

# this extract the time string
df['minute'] = df['datetime'].dt.strftime('%H-%M')

output = df.pivot_table(index='minute', columns='id', values='value')
Quang Hoang
  • 146,074
  • 10
  • 56
  • 74
  • Good, I expected to be a way, but I did not know the term. Really helps. – myradio Feb 21 '21 at 23:07
  • Ok, Now I looked at it better, this still has one difference from what I want. My dataframe might not have data for every minute in the day, but I want the matrix to be 1440 anyways, so, to have one row per minute regardless whether it is filled or not. – myradio Feb 22 '21 at 07:52
  • Ok, got it. I use int minute to make it easier, `df['minute'] = (df['datetime'].dt.hour*60 + df['datetime'].dt.minute).astype(int)`. the I do as you did, `output = df.pivot_table(index='minute', columns='id', values='values')`. And Finally, just reindexed the final 'matrix-shaped' dataframe: `output_full = output.reindex(np.arange(1,1441), fill_value=0)`. – myradio Feb 22 '21 at 08:25