Splitting training and testing data

Question

I have a dataset of around 15,500 rows. The data set consist of two columns: text column (independent variable) and output (dependent variable). Output has binary values (i.e. 0 and 1). Around 9500 rows have a value for Output columns (i.e. I can use it for training purpose) and the remaining 6000 rows (that do not have output column value) I want to use it for testing purpose. All rows (15500) are in one single file. I created a model definition file in which I used parallel_CNN encoder for the text column. I used the following command to run to train and test the dataset:

ludwig experiment --dataset dataset_name.csv --config_file model_definitions.yml

Now the problem is that I don't tell the program to use the first 9500 rows to train the program and the remaining rows to test the model. Is there any way in Ludwig that I could pass any argument to tell which number of rows to be used for training and which rows should be used for testing? or is there any better way of doing the same task?

Have you tried `--training_set` and `--test_set` [arguments](https://ludwig-ai.github.io/ludwig-docs/user_guide/#experiment)? — bartolo-otrit, Feb 28 '21 at 13:35
@bartolo-otrit I tried that but it did not work. The accuracy of the model (when use --training_set and --test_set) was 0.0 — user2293224, Mar 01 '21 at 08:04
_it did not work_. Have you split your single file in two, so that the first contains only training data? Have you tried to train the model with training data in that file without the test data? — bartolo-otrit, Mar 01 '21 at 10:20
@bartolo-otrit Thanks for the suggestion. I followed your advice and trained the model based on training data. Then I predicted the test data using trained model. Now I got the predicted values. However, I am not sure how to calculate the accuracy of the prediction. In test data all value of dependent variable is all blank. Any suggestion? — user2293224, Mar 03 '21 at 06:54
In supervised learning (as far as I know) ground truth values are necessary for accuracy evaluation. If your data isn't labeled, how will you find out if a prediction is correct? — bartolo-otrit, Mar 03 '21 at 10:16

Splitting training and testing data

0 Answers0