SystemML Decision Tree - "NUMBER OF SAMPLES AT NODE 1.0 CANNOT BE REDUCED TO MATCH 10"

Question

I am trying to run a decision tree on SystemML standalone version on Windows (https://github.com/apache/incubator-systemml/blob/master/scripts/algorithms/decision-tree.dml) but I keep receiving the error "NUMBER OF SAMPLES AT NODE 1.0 CANNOT BE REDUCED TO MATCH 10. THIS NODE IS DECLAR ED AS LEAF!". It seems like the code is not computing any split, although I am able to perform tree via R. Has anyone used this algorithm before and has some tips on how to solve the error? Thank you

score 1 · Accepted Answer · edited Aug 03 '16 at 06:45

1

This message generally indicates that a split on the best categorical or scale features would not give any additional gain.

I would recommend to

Investigate the computed gain (best_cat_gain, best_scale_gain)
Double check that the meta data (num_cat_features, num_scale_features) is correctly recognised.

You could simply put additional print statements into the script to do that. In case the meta data is invalid, you might want to check that the optional input R has the right layout as described in the header of the script.

If this does not help, please share the input arguments, format of input data, etc and we'll have a closer look.

edited Aug 03 '16 at 06:45

the_unknown_spirit

2,518
7
34
56

answered Aug 03 '16 at 05:47

mboehm7

115
3

Thank you very much for your reply. – Elly Aug 03 '16 at 17:56
1

My data has a binary target (0/1) and 10 variables, all numeric. I have created 2 .mtx files. One called Y (n*2) contains just the target in the following format: if row 1 has target 0 then Y has as first column (0,1) if row has target 1 then row 2 has values (1,0). The other .mtx file has the 10 variables, hence it is a matrix (n*10) – Elly Aug 03 '16 at 18:12
1

I have creted 2 metadata as well which have the following format "{ "data_type": "matrix", "value_type": "double", "rows": 150000, "cols": 2, "nnz": 150001, "format": "csv", "header": false, "sep": ",", "description": { "author": "SystemML" } }" The command I run is the following: runStandaloneSystemML.bat scripts/algorithms/decision-tree.dml -nvargs X=C:/Users/Documents/data/X.mtx Y=C:/Users/Documents/data/Y.mtx M=C:/Users/Documents/data/model.csv – Elly Aug 03 '16 at 18:12
Hi, I have checked the metadata and it looks ok to me: num_cat_features=0 and num_scale_features=10. However there seem to be something wrong in the computed gain because both the cat_gain and the scale_gain are =0. Are you able to confirm how the matrix Y should be set up ? As per script "Y =Location to read label matrix Y; note that Y needs to be both recoded and dummy coded". Thanks – Elly Aug 03 '16 at 20:56
1

Thanks for sharing the input characteristics. This might be an input format mismatch. Please try to create the recoded/dummy-coded Y via SystemML to ensure consistency between data and meta data. You said you created an mtx file (this extension is usually used for matrix market format, i.e., text in ijv format) but the json meta data shows "csv" (instead of "text"). Also the number of non-zeros (nnz) are supposed to be 150000 instead of 150001. Given your 0/1 label vector Y_orig, you can create Y as follows: `Y = table(seq(1, nrow(Y_orig)), Y_orig+1); write(Y, "test/dtree/Y");` – mboehm7 Aug 04 '16 at 05:24
Thanks. Do you have some sample data and code you could share? – Elly Aug 04 '16 at 08:18
1

Hi - it worked! I think your advice was right - I check the metadata and adjusted the nnz value. I also used .csv rather than .mtx in the end. Thank you very much. I am trying other algorithms so I will probably reach out if I have further questions. Thanks again! – Elly Aug 04 '16 at 12:21
great - I'm glad to hear that. – mboehm7 Aug 04 '16 at 21:34

SystemML Decision Tree - "NUMBER OF SAMPLES AT NODE 1.0 CANNOT BE REDUCED TO MATCH 10"

1 Answers1