2

I want to train the lightgbm model as follows:

train = pd.read_csv(path + "all_train.csv")

# last 8 days for online predictions
# left days for offline train&valid dataset
online_pred = get_windows(train, 541, 548+1)
offline_train = get_windows(train, 0, 541-7)
offline_valid = get_windows(train, 541-7, 541)

cate_feat = train.select_dtypes(include=[np.object]).columns

features = [c for c in train.columns if #(c not in cate_feat) & 
            (c not in ['pv', 'uv'])
           ]

train_x = offline_train[features]
train_y = offline_train['pv']

valid_x = offline_valid[features]
valid_y = offline_valid['pv']

for it in cate_feat:
    train_x.loc[:,it] = LabelEncoder().fit_transform(train_x[it].astype(str))
    valid_x.loc[:,it] = LabelEncoder().fit_transform(valid_x[it].astype(str))

print(train_x.head())
print(valid_x.head())

del offline_train, offline_valid
import gc
gc.collect()
from time import sleep
sleep(30)

for it in cate_feat:
    train_x = train_x.astype('category')

    valid_x = train_y.astype('category')



trn_data = lgb.Dataset(train_x.values, label=train_y.values)
val_data = lgb.Dataset(valid_x.values, label=valid_y.values)

del train_x, train_y, valid_x, valid_y, train
gc.collect()
sleep(30)
pv_predict = np.zeros((online_pred.shape[0], ))

clf = lgb.train(params, trn_data, 10000, valid_sets=[trn_data, val_data], verbose_eval=100, early_stopping_rounds=500, 
                categorical_feature=cate_feat,
                feval=cita_score, evals_result=None
               )

pred = lgb.pred

And here is the train DataFrame:

print(train.head())

   time      event_type  pv   uv   distinct_id          browser  \
0  20181101  $pageview  6549  674 -8539420110265898132     NaN   
1  20181101  $pageview  6549  674 -1032985922238039245  Chrome   
2  20181101  $pageview  6549  674 -1032985922238039245  Chrome   
3  20181101  $pageview  6549  674 -1032985922238039245  Chrome   
4  20181101  $pageview  6549  674 -1046230289121081999     NaN   

  browser_version  is_first_day  is_login lib lib_version       os os_version  \
0             NaN           1.0         0  JS     4.1.0.3      NaN        NaN   
1    70.0.3538.77           1.0         0  JS     4.1.0.3  Windows         10   
2    70.0.3538.77           1.0         0  JS     4.1.0.3  Windows         10   
3    70.0.3538.77           1.0         0  JS     4.1.0.3  Windows         10   
4             NaN           1.0         0  JS     4.1.0.3      NaN        NaN   

  platform  screen_height  screen_width     title  \
0       JS         1024.0        1024.0     demo   
1       JS         1080.0        1920.0     register   
2       JS         1080.0        1920.0     demo   
3       JS         1080.0        1920.0     register      
4       JS         1600.0        1600.0     private deploy   

                                                 url      country  province city  \
0  https://ark.analysys.cn/portal/industry-demo.html      China       PK    PK   
1  https://ark.analysys.cn/view/sign/signup.html?...      China       PK    PK   
2  https://ark.analysys.cn/portal/industry-demo.html      China       SH    SH   
3  https://ark.analysys.cn/view/sign/signup.html?...      China       SH    SH   
4  https://ark.analysys.cn/portal/access-private....      China       PK    PK   

                                            referrer     is_first_time model  \
0                                                NaN            NaN    NaN   
1  https://ark.analysys.cn/?utm_campaign=%E6%96%B...            NaN    NaN   
2  https://ark.analysys.cn/?utm_campaign=%E6%96%B...            NaN    NaN   
3  https://ark.analysys.cn/portal/industry-demo.html            NaN    NaN   
4                                                NaN            NaN    NaN   

  brand utm_campaign utm_content utm_medium utm_source utm_term  \
0   NaN          NaN         NaN        NaN        NaN      NaN   
1   NaN          NaN         NaN        NaN        NaN      NaN   
2   NaN          NaN         NaN        NaN        NaN      NaN   
3   NaN          NaN         NaN        NaN        NaN      NaN   
4   NaN          NaN         NaN        NaN        NaN      NaN   

  utm_campaign_id  startup_time time_zone  web_crawler traffic_source_type  \
0             NaN  1.540976e+12       NaN          NaN                 NaN   
1             NaN  1.541053e+12       NaN          NaN                 NaN   
2             NaN  1.541053e+12       NaN          NaN                 NaN   
3             NaN  1.541053e+12       NaN          NaN                 NaN   
4             NaN  1.541018e+12       NaN          NaN                 NaN   

   search_engine social_share_from referrer_domain  social  scene  \
0            NaN               NaN             NaN     NaN    NaN   
1            NaN               NaN             NaN     NaN    NaN   
2            NaN               NaN             NaN     NaN    NaN   
3            NaN               NaN             NaN     NaN    NaN   
4            NaN               NaN             NaN     NaN    NaN   

  search_keyword  scene_type  channel language session_id  social_media  \
0            NaN         NaN      NaN      NaN        NaN           NaN   
1            NaN         NaN      NaN      NaN        NaN           NaN   
2            NaN         NaN      NaN      NaN        NaN           NaN   
3            NaN         NaN      NaN      NaN        NaN           NaN   
4            NaN         NaN      NaN      NaN        NaN           NaN   

   signup_time url_domain  is_time_calibrated  click_x  click_y device_type  \
0          NaN        NaN                 NaN      NaN      NaN         NaN   
1          NaN        NaN                 NaN      NaN      NaN         NaN   
2          NaN        NaN                 NaN      NaN      NaN         NaN   
3          NaN        NaN                 NaN      NaN      NaN         NaN   
4          NaN        NaN                 NaN      NaN      NaN         NaN   

   element_path  page_height  page_width  event_duration  viewport_height  \
0           NaN          NaN         NaN             NaN              NaN   
1           NaN          NaN         NaN             NaN              NaN   
2           NaN          NaN         NaN             NaN              NaN   
3           NaN          NaN         NaN             NaN              NaN   
4           NaN          NaN         NaN             NaN              NaN   

   viewport_position  viewport_width  campaign_shortlink  pagename  nav_name  \
0                NaN             NaN                 NaN       NaN       NaN   
1                NaN             NaN                 NaN       NaN       NaN   
2                NaN             NaN                 NaN       NaN       NaN   
3                NaN             NaN                 NaN       NaN       NaN   
4                NaN             NaN                 NaN       NaN       NaN   

   referrer_demo  board_name  click_position datafrom  day  
0            NaN         NaN             NaN      NaN    0  
1            NaN         NaN             NaN      NaN    0  
2            NaN         NaN             NaN      NaN    0  
3            NaN         NaN             NaN      NaN    0  
4            NaN         NaN             NaN      NaN    0  

When I try to use lgb.train to train model, I get error:

 41 clf = lgb.train(params, trn_data, 10000, valid_sets=[trn_data, val_data], verbose_eval=100, early_stopping_rounds=500, 
 42                 categorical_feature=cate_feat,

---> 43 feval=cita_score, evals_result=None 44 ) 45

~/.local/lib/python3.5/site-packages/lightgbm/engine.py in train(params, train_set, num_boost_round, valid_sets, valid_names, fobj, feval, init_model, feature_name, categorical_feature, early_stopping_rounds, evals_result, verbose_eval, learning_rates, keep_training_booster, callbacks) 140 ._set_predictor(predictor) \ 141 .set_feature_name(feature_name) \ --> 142 .set_categorical_feature(categorical_feature) 143 144 is_valid_contain_train = False

~/.local/lib/python3.5/site-packages/lightgbm/basic.py in set_categorical_feature(self, categorical_feature) 1196
Dataset with set categorical features. 1197 """ -> 1198 if self.categorical_feature == categorical_feature: 1199 return self 1200 if self.data is not None:

ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()

I have searched some related questions and I find some of them get such errors because of numpy.ndarray problems like here.
But I think I didn't find similar problems here.
So could anyone help me?
Thanks in advances.

Bowen Peng
  • 1,635
  • 4
  • 21
  • 39

3 Answers3

1

Your problem is here:

cate_feat = train.select_dtypes(include=[np.object]).columns

This creates a numpy array base type for cate_feat which expects a list according to the docs. Problem is an array does boolean calculations element-wise:

np.array([1,2]) == np.array([1,2])
np.array([True, True])

Which the if statement can't parse (it's looking for a single boolean).

Meanwhile lists compare the whole

[1,2] == [1,2]
True

So the if statement works.

The solution should be to turn that initial line into

cate_feat = list(train.select_dtypes(include=[np.object]).columns)

I think everything else should work after that.

Daniel F
  • 13,620
  • 2
  • 29
  • 55
0

First 1 couldn't understand the logic of writing this twice.

for it in cate_feat:
    train_x = train_x.astype('category')

    valid_x = train_y.astype('category')


for it in cate_feat:
    train_x.loc[:,it] = LabelEncoder().fit_transform(train_x[it].astype(str))
    valid_x.loc[:,it] = LabelEncoder().fit_transform(valid_x[it].astype(str))

you aren't even using the iterator anywhere in the loop in the first case.

2nd, this reference might be able to help. Read the Accepted answer and try to implement it in the context of your problem.

Amit Gupta
  • 2,698
  • 4
  • 24
  • 37
0

The following is the thing that causes the error to be raised.

~/.local/lib/python3.5/site-packages/lightgbm/basic.py

1200 |    if self.data is not None:

Look at the reference here. lightgbm.train() method expects 'train' parameter to be a lightgbm.Dataset for which you can find the documentation here.

So you need to create a lightgbm.Dataset object for your datasets. You can do it as follows:

import lightgbm as lgb
...
lgb_train = lgb.Dataset(data=X_train, label=y_train)

Then, you can train your model without any errors.

Community
  • 1
  • 1
null
  • 1,944
  • 1
  • 14
  • 24