7

I use this code to test CatBoostClassifier.

import numpy as np
from catboost import CatBoostClassifier, Pool

# initialize data
train_data = np.random.randint(0, 100, size=(100, 10))
train_labels = np.random.randint(0, 2, size=(100))
test_data = Pool(train_data, train_labels) #What is Pool?When to use Pool?
# test_data = np.random.randint(0,100, size=(20, 10)) #Usually we will use numpy array,will not use Pool

model = CatBoostClassifier(iterations=2,
                           depth=2,
                           learning_rate=1,
                           loss_function='Logloss',
                           verbose=True)
# train the model
model.fit(train_data, train_labels)
# make the prediction using the resulting model
preds_class = model.predict(test_data)
preds_proba = model.predict_proba(test_data)
print("class = ", preds_class)
print("proba = ", preds_proba)

The description about Pool is like this:

Pool used in CatBoost as a data structure to train model from.

I think usually we will use numpy array,will not use Pool.

For example we use:

test_data = np.random.randint(0,100, size=(20, 10))

I did not find any more usage of Pool, so I want to know when we will use Pool instead of numpy array?

Antony Hatchkins
  • 31,947
  • 10
  • 111
  • 111

3 Answers3

3

Catboost only works with Pools, which is internal data format. If you pass numpy array to it, it will implicitly convert it to Pool first, without telling you. If you need to apply many formulas to one dataset, using Pool drastically increases performance (like 10x), because you'll omit converting step each time.

2

My understanding of a Pool is that it is just a convenience wrapper combining features, labels and further metadata like categorical features or a baseline.
While it does not make a big difference if you first construct your pool and then fit your model using the pool, it makes a difference when it comes to saving your training data. If you save all the information separately it might get out of sync or you might forget something and when loading you need couple of lines to load everything. The pool comes in very handy here.
Note that when fitting you can also specify an evaluation dataset as a pool. If you want to try multiple evalutation datasets, it is quite handy to have them wrapped up in a single object - that's what the pools are for.

Paul
  • 1,114
  • 8
  • 11
1

The most important thing about catboost is that we need not to encode the categorical features in our dataset. catBoost has in built one hot encoder hyperparameter, which can be used only when cat_features hyperparameter is specified. Now the cat_features hyperparameter is hard to define as error pops out as soon as we specify an array. The definition is made simpler using Pool.

Mainak Sen
  • 63
  • 6