0

I would like to create a train_test_split function that splits a matrix (vector of vectors) of data into two other matrices, similar to what sklearn's function does. This is my attempt in doing so:

#include <iostream> 
#include <cstdlib>
#include <fstream> 
#include <time.h>
#include <vector>  
#include <string> 

using namespace std;

vector<vector<float>> train_test_split(vector<vector<float>> df, float train_size = 0.8){
  vector<vector<float>> train; 
  vector<vector<float>> test; 
  srand(time(NULL)); 
  for(int i = 0; i < df.size(); i++){
    int x = rand() % 10 + 1; 
    if(x <= train_size * 10){
      train.push_back(df[i]);
    } 
    else{
      test.push_back(df[i]);
    }
  }
  return train, test;
} 

int main(){
   vector<vector<float>> train;
   vector<vector<float>> test; 
   vector<vector<float>> df = {{1,2,3,4}, 
                               {5,6,7,8},
                               {9,10,11,12}};

   train, test = train_test_split(df); 
   cout << "training size: " << train.size() << ", test size: " << test.size() << endl; 
   return 0; 
}

This approach sends data only in the test matrix. After some research, I have discovered that C++ cannot output two values in the same function. I am very new in C++, and I am wondering what would be the best way to approach this. Any help will be appreciated.

Olive Yew
  • 351
  • 4
  • 13

1 Answers1

1

A function can only return one value. Though look at your function declaration: It is declared to return a vector<vector<float>>, and thats a container of many vector<float>s. Containers can contain many elements (of same type) and custom types can contain many members:

 struct train_test_split_result {
      vector<vector<float>> train; 
      vector<vector<float>> test; 
 };

 train_test_split_result train_test_split(vector<vector<float>> df, int train_size = 0.8) {
      train_test_split_result result;
      // ...
      // result.train.push_back(...)
      // result.test.push_back(...)
      // ...
      return result;
}

int main(){
   vector<vector<float>> df = {{1,2,3,4}, 
                               {5,6,7,8},
                               {9,10,11,12}};

   train_test_split_result result = train_test_split(df); 
   cout << "training size: " << result.train.size() << ", test size: " << result.test.size() << endl; 
}

PS: You should turn up your compilers warnings and read them! Then read this: How does the Comma Operator work

PPS: A nested vector is a terrible data structure for a matrix. std::vector benefits a lot from memory locality, but because its elements are dynamically allocated, the floats in a std::vector<std::vector<float>> are scattered around in memory. If the size is known at compile time and not too big (that it would require dynamic allocation) you can use a nested array. Alternatively use a flat std::vector<float> to store the matrix.

PPPS: There are also "out paramters": The function can have arguments by non-const reference, the caller passes them and the function modifies them. Though generally out-parameters are not recommended.

463035818_is_not_an_ai
  • 109,796
  • 11
  • 89
  • 185
  • Wow, thanks for the detailed answer, lots to unpack here. I have a silly mistake, and as input to the function i'm using int, instead of float, thus rounding the train size to 1. Could you update your solution please, I'll do the same. Unfortunately, the size of the matrix is not known at compile time, since the data are coming from a csv file. Also could you elaborate briefly on how auto works in this case? – Olive Yew Sep 22 '22 at 13:13
  • 1
    @OliveYew `int` or `float` doesnt really matter for the answer. Do you know the number of columns in the csv in advance? A `std::vector>` has memory locality, ie you only need to fix one dimension. `auto` isnt essential for the answer, I'll remove it – 463035818_is_not_an_ai Sep 22 '22 at 13:16
  • I see, the number of columns is not known in advance since I'm writing some generic functions that I will utilize in the future. I can however write a function that counts the number of columns, and then input the data in an array. Thanks for the tip! – Olive Yew Sep 22 '22 at 13:19