1

Have a project I'm working on and am running into an issue. Essentially I these points scattered across an x / y plot. I have one test point, where I get the target data (y) for the classification (number from 1 - 6). I have lots points where I have depth indexed data, with some features. The issue with these points is that I don't get a lot of data per point (maybe 100 points).

I'm using the point closest to the test point to fit the model, then trying to generalize that to the other points that are farther apart. It's not giving me great results.

I understand there's not a lot of data to fit to so I'm trying to improve the model by adding a set of 'k' points close to the test point.

These points all share the same columns, so I've tried to add vertically, but then my indexes don't match with the predictor variable y.

I've tried to concat them at the end using a suffix denoting the specific point id, but then I get an error about the amount of input features (for one point) when I try predicting again with the model using combined features.

Essentially what I'm trying to do is the following :

model.fit([X_1,X_2,X_3,X_4],y)

model.predict(X_5)

Where : All features are numeric (floats)

X_1.columns = X_i.columns

Each X matrix is about 100 points long with a continuous index [0:100].

I only have one test point (with 100 observations) for each group of points, so it's imperative I use as much data close to the test point as possible.

Is there another model or technique I can use for this? I've done a bit more research into NN models (not familiar so would prefer to avoid), and found that Keras has the ability to take multiple inputs to fit using their functions API, but can I predict with only one input after it has been fitted to multiple?

Keras Sequential model with multiple inputs

  • This question is vary vague, and doesn't really meet the criteria to be on-topic for this site. Please edit it to provide a [mcve], or post to [datascience.se] instead. That said, based on your description, it seems like you have independently invented the [k-neighbors classifier](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html), so that might be a good avenue to start researching – G. Anderson Oct 22 '20 at 21:59
  • Hi. Have tried to add some examples to it. I know it's very vague, unfortunately I can't provide any of the data as it's protected and not public. – brspencer90 Oct 22 '20 at 22:15

1 Answers1

0

Could you give more information about the features / classes, and the model you're using? It would make things easier to understand.

However, I can give two pointers based on what you've said so far.

  1. To have a better measurement of how well your model is generalizing, you should have more than one test point. See https://en.wikipedia.org/wiki/Training,_validation,_and_test_sets

  2. Sounds like you're using a k-Nearest Neighbors approach. If you aren't already, using the sklearn implementation will save a lot of time, and you can easily experiment with different hyperparameters: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html

  3. Other techniques: I like to start off with XGBoost or Random Forest, as those methods require little tuning and are reasonably robust. However, there is no magic bullet cure for modeling on a small dataset. The best thing to do would be to collect more data, or if that's impossible, you need to drill down and really understand your data (identify outliers, plot histograms / KDE, etc.).

Syllabear
  • 193
  • 5
  • Hi! Thanks for your feedback. I've edited the post to add some clarification, I understand it's vague I can't actually release any of the data or code, but hopefully it has given some clarification. I'm able to fit it well to one set of data, however it doesn't generalize it due to the small data size. I want to artificially increase the data size by adding more sets of data to the fitting of the model. – brspencer90 Oct 22 '20 at 22:24