Using OneHotEncoding in Microsoft.ML.AutoML

Question

In my project, I am forced to do some Machine Learning with C#. Unfortunately, ml.net is much less intuitive than in all other languages, and I fail to execute a RegressionExperiment.

First, here are my data classes:

public class DataPoint
{
    [ColumnName("Label")]
    public float y { get; set; }

    [ColumnName("catFeature")]
    public string str { get; set; }

    [ColumnName("smth")]
    public float smth { get; set; }
}


public class MLOutput
{
   [ColumnName("Score")]
   public float score { get; set; }
}

I think my problem lies in the encoding of a category variable. For a single model, the code below works fine.

//Create an ML Context
var ctx = new MLContext();

IDataView trainingData = ctx.Data.LoadFromEnumerable(data: data as IEnumerable<DataPoint>);

// Build your data processing and training pipeline
var pipeline = ctx.Transforms.Categorical.OneHotEncoding(outputColumnName: "catFeatureEnc", inputColumnName: "catFeature")
.Append(ctx.Transforms.Concatenate("Features", new[] {"catFeatureEnc","smth"}))
.Append(ctx.Regression.Trainers.FastForest());

// Train your model ????
var trainedModel = pipeline.Fit(trainingData); // shouldn't we transform before fit?????
IDataView transformedData = trainedModel.Transform(trainingData);

Now, removing the FastForest model from the pipeline and adding the AutoML code, Microsoft.ML cannot handle the encoding:

// Build your data processing and training pipeline
var pipeline = ctx.Transforms.Categorical.OneHotEncoding(outputColumnName: "catFeatureEnc", inputColumnName: "catFeature")
.Append(ctx.Transforms.Concatenate("Features", new[] {"catFeatureEnc","smth"}))
.Append(ctx.Transforms.Conversion.ConvertType("Features", "Features", DataKind.Single));

// do smth  ???
var trainedModel = pipeline.Fit(trainingData); // nothing there to be fitted???
IDataView transformedData = trainedModel.Transform(trainingData); // shouldn't we transform before fit?????


var experimentSettings = new RegressionExperimentSettings();
experimentSettings.MaxExperimentTimeInSeconds = 60;

// Cancel experiment after the user presses any key 
var cts = new CancellationTokenSource();
experimentSettings.CancellationToken = cts.Token;

RegressionExperiment experiment = ctx.Auto().CreateRegressionExperiment(experimentSettings);
ExperimentResult<RegressionMetrics> experimentResult = experiment.Execute(transformedData, "Label");

Now, I get the following exception:

Only supported feature column types are Boolean, Single, and String. Please change the feature column catFeatureEnc of type Key<UInt32, 0-2> to one of the supported types. "

If I remove catFeatureEnc from the Concatenate call, the code works fine. Alternatively, I tried to create a new pipeline for the training with the transformed data. Unfortunately, this approach doesn't work in the slightest, as the new pipeline expects arbitrary data types for many features.

Another alternative approach:

ExperimentResult<RegressionMetrics> experimentResult = experiment.Execute(trainingData, "Label");

throws the exception:

Training failed with the exception: System.InvalidOperationException: Concatenated columns should have the same type. Column 'smth' has type of Single, but the expected column type is Byte

Idk...Why is a Byte expected?

How can I use the encoded feature with Microsoft Auto.ML?

score 0 · Answer 1 · answered Oct 18 '22 at 16:06

0

It looks like you're using the old version of the API. I would recommend trying the latest version.

To use it, you'll have to add the ML.NET daily feed.

https://pkgs.dev.azure.com/dnceng/public/_packaging/MachineLearning/nuget/v3/index.json

Here is a few samples using the new API:

AutoML with column inference and Auto featurizer

You can also take a look at this other sample which include OneHotEncoding. AutoML with data processing pipeline

answered Oct 18 '22 at 16:06

Luis Quintanilla

569
2
3

Ok, I read one Microsoft Documentation for ML.NET and used the associated API that does not work. So you suggest doing the same with a newer (and preview) version?? Why should the result differ? I guess I have to abandon ML.NET - due to its poor performance at least. Instead, I have read how to use a Python Model in C#. – StephanH Oct 19 '22 at 13:14
"I read one Microsoft Documentation for ML.NET" - We're in the process of updating our docs to the new API. "Why should the result differ" - Because we've made significant improvements to the API. Also because it works. Here is your sample using the new API (https://gist.github.com/luisquintanilla/1a48ee82b9936995bee5a28d4e69d0b6) "So you suggest doing the same with a newer (and preview) version" - Yes because that's the API we'll use going forward when ML.NET 2.0 releases in a few weeks. "due to its poor performance" - I'd be interested in learning more why this is the case. – Luis Quintanilla Oct 20 '22 at 15:26

Using OneHotEncoding in Microsoft.ML.AutoML

1 Answers1