0

I have data of the following shape in Pandas, to be used with scikit-learn.

temp2004[["station_id"] + hours]
Out[112]: 
       station_id  h1  h2  h3  h4  h5  h6  ...  h18  h19  h20  h21  h22  h23  h24
0               1  43  44  42  34  30  35  ...   53   43   37   36   36   43   44
1               1  45  46  47  46  46  45  ...   52   46   45   42   43   44   47
2               1  46  46  42  41  42  41  ...   69   65   64   60   61   62   61
3               1  62  62  60  60  60  60  ...   70   67   67   63   63   63   64
4               1  64  61  62  61  62  62  ...   62   60   60   58   57   53   51
          ...  ..  ..  ..  ..  ..  ..  ...  ...  ...  ...  ...  ...  ...  ...
16561          11  29  30  30  30  29  29  ...   30   29   28   27   27   25   22
16562          11  21  20  20  19  19  18  ...   36   33   33   35   35   34   35
16563          11  35  36  36  38  37  37  ...   55   50   50   49   47   47   48
16564          11  46  43  40  37  36  36  ...   51   50   50   47   45   46   44
16565          11  44  45  45  41  40  38  ...   59   54   51   52   51   48   52

[4026 rows x 25 columns]

I need to reshape this array a bit, such that the station_id is the feature, hx values become a column of samples below them.

I tried experimenting with Pandas stack() method just using a single station_id, with the following results:

temp.stack()
Out[72]: 
11340  h1     36
       h2     32
       h3     31
       h4     30
       h5     34
              ..
11705  h20    55
       h21    55
       h22    56
       h23    54
       h24    53
Length: 8784, dtype: int64

This is exactly what I'm looking for, but I need columns for all the other station ID's. Is there a good way to do this? Worst case, I can just create a column for each station and combine them, I think.

In order to get that one column to work with scikit-learn, I had to do the following:

load = load2004[load2004.zone_id == zoneNumber][hours];
temp = temp2004[temp2004.station_id == zonetobeststation[zoneNumber]][hours];

temp_x = temp.stack().values.reshape(-1, 1);
load_y = load.stack().values.reshape(-1, 1);

temp_train, temp_test, load_train, load_test = train_test_split(temp_x, load_y, test_size=.2, random_state=89986);


dtr = tree.DecisionTreeRegressor(random_state=0);
dtr.fit(temp_train, load_train);

                
gbr = ensemble.GradientBoostingRegressor(max_depth=10);
gbr.fit(temp_train, load_train.ravel());


dtr.score(temp_test, load_test)
gbr.score(temp_test, load_test.ravel())

Both stacked results needed to be reshaped, and I needed to use .ravel() on the target vector. My worry is that if I just blindly make columns out of each station, I'll not be able to correctly shape them for scikit-learn.

Ryan Brothers
  • 45
  • 1
  • 6
  • 1
    `print(temp.melt(id_vars=["station_id"]))` is what you're looking for? – Andrej Kesely May 10 '21 at 18:50
  • @AndrejKesely This is similar to what I'm looking for, insofar as I think I can get what I need out of it. This yielded 3 columns, grouped into station_id's. What I'm looking for is a table with 11 columns, one for each station ID, essentially. – Ryan Brothers May 10 '21 at 18:59

0 Answers0