Methods for selecting a diverse data-set subsample from multi-dimensional space in machine learning?

Question

I was thinking about creating a training-set which is as diverse as possible while compressing the data to a smaller size (depending on how similar the data-points are in the data-set). This to prevent overfitting to relatively unimportant parts of the data. Explanation follows:

The problem description is as follows: I am training on car racing data, a significant part of the road is relatively straight. This part of the data contains relatively little variance and is relatively unimportant. Just stay on the road and speed up. The most difficult parts is in my view: corner anticipation, the speed/angle in which you take the corner.

In order to simplify the problem and maximise learning to this part, I would like to select only data-points that are significantly different. Therefore cutting down the data on straight parts significantly (and same type of corners) while maintaining data on different type of corners. The data is basically a vector of 50 dimensions. I want to keep the number of dimensions, only want to make the density of data points in this multi-dimensional space more equal. I do not know any good way to quantify "more equal" either. Therefore this question is basically a question about pre-processing the data.

Are there any methods out there that already do this or are there other methods that achieve the same objective?

I159 · Accepted Answer · 2021-12-03T11:13:00.100

1

If I understood your data set correctly, you need to smooth the vector and then get the most significant deviations of the original vector from a smoothed one. Savitzky–Golay filter is a common way to smooth data through an array (vector). If you decided to use Python, then scipy.signal.savgol_filter is what you need.

A good answer related to the topic.

edited Dec 03 '21 at 11:13

answered Dec 01 '16 at 14:56

I159

29,741
31
97
132

Methods for selecting a diverse data-set subsample from multi-dimensional space in machine learning?

1 Answers1