Python: Numpy and Pandas Transforming timestamp/data into one-hot-encoding

Question

I have a column of a dataframe that is like this

              time
0       2017-03-01 15:30:00
1       2017-03-01 16:00:00
2       2017-03-01 16:30:00
3       2017-03-01 17:00:00
4       2017-03-01 17:30:00
5       2017-03-01 18:00:00
6       2017-03-01 18:30:00
7       2017-03-01 19:00:00
8       2017-03-01 19:30:00
9       2017-03-01 20:00:00
10      2017-03-01 20:30:00
11      2017-03-01 21:00:00
12      2017-03-01 21:30:00
13      2017-03-01 22:00:00
.
.
.

I want to "encode" the time of the day. I want to do this by firsly assigning each half an-hour a integer number. Starting from

 00:30:00 --> 1
 01:00:00 --> 2
 01:30:00 --> 3
 02:00:00 --> 4
 02:30:00 --> 5

and so on. Therefore we would have 48 numbers (since there are 24 hours). I would like to find the fastest way of transforming my column into a list/column containing those values.

So far I can do this for one value. For instance

2*int(timeDF.iloc[0][11:13]) + int(int(timeDF.iloc[0][14:16])/30) would transform 15:30:00 into 31.

I think I could do this by doing a loop where instead of using 0 I use an index that loops through the length of the column. However is there a faster way?

one hot encoding

After finding those values, I would use some one-hot-encoder, I think sklearn has one. But the most difficult part is this

stupid solution

labels = []
for date in time:
    labels.append(2*int(date[11:13]) + int(int(date[14:16])/30))

This would contain the values and then one could do something like here

You want to map the values to the dataframe from the list obtained? — Bharath M Shetty, Oct 06 '17 at 13:55
Yes, if I understand your question correctly. The dataframe has one column called `time`. In this column there are dates and times. I only care about the time of the day. I want to use this as a feature in a regression. Therefore I need to map the time to some number. The way I want to do it is such that you take the column, transform/map the time to the corresponding number (e.g `00:30:00` to `1`) and then this number to one-hot-encoder, for instance `[1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0]` — Euler_Salter, Oct 06 '17 at 13:57

jezrael · Accepted Answer · 2017-10-06T14:18:07.587

I think you need map with get_dummies.

Also it seems for first time 0:00 need 0, 0:30 - 1 so using range(48)

#convert to datetimes if necessary
df['time'] = pd.to_datetime(df['time'])

#create dictionary for map
a = dict(zip(pd.date_range('2010-01-01', '2010-01-01 23:59:39', freq='30T').time, range(48)))

#convert time column to times and map by dict
df['a'] = df['time'].dt.time.map(a)
print (df)
                  time   a
0  2017-03-01 15:30:00  31
1  2017-03-01 16:00:00  32
2  2017-03-01 16:30:00  33
3  2017-03-01 17:00:00  34
4  2017-03-01 17:30:00  35
5  2017-03-01 18:00:00  36
6  2017-03-01 18:30:00  37
7  2017-03-01 19:00:00  38
8  2017-03-01 19:30:00  39
9  2017-03-01 20:00:00  40
10 2017-03-01 20:30:00  41
11 2017-03-01 21:00:00  42
12 2017-03-01 21:30:00  43
13 2017-03-01 22:00:00  44

#for one hot encoding use get_dummies
df1 = pd.get_dummies(df['time'].dt.time.map(a))
print (df1)
    31  32  33  34  35  36  37  38  39  40  41  42  43  44
0    1   0   0   0   0   0   0   0   0   0   0   0   0   0
1    0   1   0   0   0   0   0   0   0   0   0   0   0   0
2    0   0   1   0   0   0   0   0   0   0   0   0   0   0
3    0   0   0   1   0   0   0   0   0   0   0   0   0   0
4    0   0   0   0   1   0   0   0   0   0   0   0   0   0
5    0   0   0   0   0   1   0   0   0   0   0   0   0   0
6    0   0   0   0   0   0   1   0   0   0   0   0   0   0
7    0   0   0   0   0   0   0   1   0   0   0   0   0   0
8    0   0   0   0   0   0   0   0   1   0   0   0   0   0
9    0   0   0   0   0   0   0   0   0   1   0   0   0   0
10   0   0   0   0   0   0   0   0   0   0   1   0   0   0
11   0   0   0   0   0   0   0   0   0   0   0   1   0   0
12   0   0   0   0   0   0   0   0   0   0   0   0   1   0
13   0   0   0   0   0   0   0   0   0   0   0   0   0   1

EDIT:

df1 = pd.get_dummies(df['time'].dt.time.map(a)).reindex(columns=range(48), fill_value=0)
    0   1   2   3   4   5   6   7   8   9  ...  38  39  40  41  42  43  44  \
0    0   0   0   0   0   0   0   0   0   0 ...   0   0   0   0   0   0   0   
1    0   0   0   0   0   0   0   0   0   0 ...   0   0   0   0   0   0   0   
2    0   0   0   0   0   0   0   0   0   0 ...   0   0   0   0   0   0   0   
3    0   0   0   0   0   0   0   0   0   0 ...   0   0   0   0   0   0   0   
4    0   0   0   0   0   0   0   0   0   0 ...   0   0   0   0   0   0   0   
5    0   0   0   0   0   0   0   0   0   0 ...   0   0   0   0   0   0   0   
6    0   0   0   0   0   0   0   0   0   0 ...   0   0   0   0   0   0   0   
7    0   0   0   0   0   0   0   0   0   0 ...   1   0   0   0   0   0   0   
8    0   0   0   0   0   0   0   0   0   0 ...   0   1   0   0   0   0   0   
9    0   0   0   0   0   0   0   0   0   0 ...   0   0   1   0   0   0   0   
10   0   0   0   0   0   0   0   0   0   0 ...   0   0   0   1   0   0   0   
11   0   0   0   0   0   0   0   0   0   0 ...   0   0   0   0   1   0   0   
12   0   0   0   0   0   0   0   0   0   0 ...   0   0   0   0   0   1   0   
13   0   0   0   0   0   0   0   0   0   0 ...   0   0   0   0   0   0   1   

    45  46  47  
0    0   0   0  
1    0   0   0  
2    0   0   0  
3    0   0   0  
4    0   0   0  
5    0   0   0  
6    0   0   0  
7    0   0   0  
8    0   0   0  
9    0   0   0  
10   0   0   0  
11   0   0   0  
12   0   0   0  
13   0   0   0  

[14 rows x 48 columns]

thank you! How would you then do the one hot encoding? So for instance: `[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]` for `00:30:00` i.e. for `1` — Euler_Salter, Oct 06 '17 at 14:08
Wait, I don't understand the last part of your solution.. why is '31' (which corresponds to '15:30:00' encoded to be '[1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]` ? Rather `1` i.e. `00:30:00` should be encoded to that. While `31` should be encoded to `[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]` — Euler_Salter, Oct 06 '17 at 14:12
thank you! Would it change the order of the rows of the column or something like that? — Euler_Salter, Oct 06 '17 at 14:21
basically my desired output would be a dataframe or an array, where each row has exactly one `1` and 47 `0`. The `1` shows what time that row corresponds to. — Euler_Salter, Oct 06 '17 at 14:23
The number of rows would be the number of rows in the original dataframe — Euler_Salter, Oct 06 '17 at 14:24

score 2 · Answer 2 · answered Oct 06 '17 at 14:07

I think this is what you are looking for i.e

x =pd.date_range("00:30", "23:30", freq="30min",format="%HH:%MM").astype(str).str[-8:]
maps = dict(zip(x,np.arange(1,48)))
df['new'] = df['time'].astype(str).str[-8:].map(maps)
pd.get_dummies(df['new']).set_index(df['time'])

Output:

                     31  32  33  34  35  36  37  38  39  40  41  42  43  44
time                                                                       
2017-03-01 15:30:00   1   0   0   0   0   0   0   0   0   0   0   0   0   0
2017-03-01 16:00:00   0   1   0   0   0   0   0   0   0   0   0   0   0   0
2017-03-01 16:30:00   0   0   1   0   0   0   0   0   0   0   0   0   0   0
2017-03-01 17:00:00   0   0   0   1   0   0   0   0   0   0   0   0   0   0
2017-03-01 17:30:00   0   0   0   0   1   0   0   0   0   0   0   0   0   0
2017-03-01 18:00:00   0   0   0   0   0   1   0   0   0   0   0   0   0   0
2017-03-01 18:30:00   0   0   0   0   0   0   1   0   0   0   0   0   0   0
2017-03-01 19:00:00   0   0   0   0   0   0   0   1   0   0   0   0   0   0
2017-03-01 19:30:00   0   0   0   0   0   0   0   0   1   0   0   0   0   0
2017-03-01 20:00:00   0   0   0   0   0   0   0   0   0   1   0   0   0   0
2017-03-01 20:30:00   0   0   0   0   0   0   0   0   0   0   1   0   0   0
2017-03-01 21:00:00   0   0   0   0   0   0   0   0   0   0   0   1   0   0
2017-03-01 21:30:00   0   0   0   0   0   0   0   0   0   0   0   0   1   0
2017-03-01 22:00:00   0   0   0   0   0   0   0   0   0   0   0   0   0   1

but in this way '31' is mapped to the list with '1' in the first entry rather than the '30th' — Euler_Salter, Oct 06 '17 at 14:14
Well this is based on sample data you have given. If you have data starting from 0:00 then it will update automatically — Bharath M Shetty, Oct 06 '17 at 14:15

Python: Numpy and Pandas Transforming timestamp/data into one-hot-encoding

2 Answers2

Linked