I have data of the following shape in Pandas, to be used with scikit-learn.
temp2004[["station_id"] + hours]
Out[112]:
station_id h1 h2 h3 h4 h5 h6 ... h18 h19 h20 h21 h22 h23 h24
0 1 43 44 42 34 30 35 ... 53 43 37 36 36 43 44
1 1 45 46 47 46 46 45 ... 52 46 45 42 43 44 47
2 1 46 46 42 41 42 41 ... 69 65 64 60 61 62 61
3 1 62 62 60 60 60 60 ... 70 67 67 63 63 63 64
4 1 64 61 62 61 62 62 ... 62 60 60 58 57 53 51
... .. .. .. .. .. .. ... ... ... ... ... ... ... ...
16561 11 29 30 30 30 29 29 ... 30 29 28 27 27 25 22
16562 11 21 20 20 19 19 18 ... 36 33 33 35 35 34 35
16563 11 35 36 36 38 37 37 ... 55 50 50 49 47 47 48
16564 11 46 43 40 37 36 36 ... 51 50 50 47 45 46 44
16565 11 44 45 45 41 40 38 ... 59 54 51 52 51 48 52
[4026 rows x 25 columns]
I need to reshape this array a bit, such that the station_id is the feature, hx values become a column of samples below them.
I tried experimenting with Pandas stack() method just using a single station_id, with the following results:
temp.stack()
Out[72]:
11340 h1 36
h2 32
h3 31
h4 30
h5 34
..
11705 h20 55
h21 55
h22 56
h23 54
h24 53
Length: 8784, dtype: int64
This is exactly what I'm looking for, but I need columns for all the other station ID's. Is there a good way to do this? Worst case, I can just create a column for each station and combine them, I think.
In order to get that one column to work with scikit-learn, I had to do the following:
load = load2004[load2004.zone_id == zoneNumber][hours];
temp = temp2004[temp2004.station_id == zonetobeststation[zoneNumber]][hours];
temp_x = temp.stack().values.reshape(-1, 1);
load_y = load.stack().values.reshape(-1, 1);
temp_train, temp_test, load_train, load_test = train_test_split(temp_x, load_y, test_size=.2, random_state=89986);
dtr = tree.DecisionTreeRegressor(random_state=0);
dtr.fit(temp_train, load_train);
gbr = ensemble.GradientBoostingRegressor(max_depth=10);
gbr.fit(temp_train, load_train.ravel());
dtr.score(temp_test, load_test)
gbr.score(temp_test, load_test.ravel())
Both stacked results needed to be reshaped, and I needed to use .ravel()
on the target vector.
My worry is that if I just blindly make columns out of each station, I'll not be able to correctly shape them for scikit-learn.