Feature engineering is the process of using domain knowledge of the data to create features that make machine learning algorithms work
Questions tagged [feature-engineering]
481 questions
23
votes
3 answers
Categorical features correlation
I have some categorical features in my data along with continuous ones. Is it a good or absolutely bad idea to hot encode category features to find correlation of it to labels along with other continuous creatures?

user8653080
- 341
- 1
- 2
- 4
14
votes
4 answers
How to deal with array of string features in traditional machine learning?
Problem
Let's say we have a dataframe that looks like this:
age job friends label
23 'engineer' ['World of Warcraft', 'Netflix', '9gag'] 1
35 'manager' NULL …

tooskoolforkool
- 192
- 1
- 9
13
votes
3 answers
Why shouldn't the sklearn LabelEncoder be used to encode input data?
The docs for sklearn.LabelEncoder start with
This transformer should be used to encode target values, i.e. y, and not the input X.
Why is this?
I post just one example of this recommendation being ignored in practice, although there seems to be…

hlud6646
- 399
- 2
- 10
9
votes
1 answer
LabelEncoder for categorical features?
This might be a beginner question but I have seen a lot of people using LabelEncoder() to replace categorical variables with ordinality. A lot of people using this feature by passing multiple columns at a time, however I have some doubt about having…

ArsenOz
- 101
- 1
- 3
7
votes
0 answers
Slow features engineering in PySpark
I am trying to make data preparation using pyspark involving among others steps such as string indexing, one hot encoding and quantile discretising. My data frame has quite many columns (1 thousand columns including 500 intervals columns, 250…

Pawel
- 81
- 2
6
votes
1 answer
How handle categorical features in the latest Random Forest in Spark?
In the Mllib version of Random Forest there was a possibility to specify the columns with nominal features (numerical but still categorical variables) with parameter categoricalFeaturesInfo
What's about the ML Random Forest? In the user guide there…

Andrew_457
- 87
- 8
5
votes
2 answers
What is the best way to perform value estimation on a dataset with discrete, continuous, and categorical variables?
What is the best approach to this regression problem, in terms of performance as well as accuracy? Would feature importance be helpful in this scenario? And how do I process this large range of data?
Please note that I am not an expert on any of…

crypthusiast0
- 407
- 2
- 4
- 19
5
votes
1 answer
What is the best method of combining image feature and numeric feature together using CNN in Machine Learning?
I've got this question here: For example, if it is necessary to predict a disease using both image data and some numeric data, so that the features would be like:
feature 1: image of the disease.
in shape: (batch_size, width,height)
feature 2:…

klein li
- 51
- 2
5
votes
2 answers
how to make features using featuretools, for the new data(on which we want to make prediction)
I have a single dataframe and want to use featuretools for auto feature engineering part. I am able to do it with normalize entities function. code snippet is below:
es = ft.EntitySet(id = 'obs_data')
es = es.entity_from_dataframe(entity_id = 'obs',…

Mohit Sharma
- 590
- 3
- 10
5
votes
2 answers
Is it a good idea to use word2vec for encoding of categorical features?
I am facing a binary prediction task and have a set of features of which all are categorical. A key challenge is therefore to encode those categorical features to numbers and I was looking for smart ways to do so.
I stumbled over word2vec, which is…

BigBrian
- 61
- 3
5
votes
3 answers
Spread an integer over several rows as many times as it is divided by a constant
I have a dataframe
Date repair
2018-07-01 4420
2018-07-02 NA
2018-07-03 NA
2018-07-04 NA
2018-07-05 NA
Where 4420 is time in minutes. I'm trying…

Dmytro Fedoriuk
- 331
- 3
- 11
5
votes
0 answers
Difference between numeric_column shape=2 and two numeric columns
Time-related data I initially have as integer in format:
1234 # corresponds to 12:34
2359 # corresponds to 23:59
1) The first option is to describe time as numeric_column:
tf.feature_column.numeric_column(key="start_time", dtype=tf.int32)
2)…

O. Korniienko
- 53
- 6
5
votes
1 answer
Cyclic ordinal features in random forest
How do you prepare cyclic ordinal features like time in a day or day in a week for the random forest algorithm?
By just encoding time with minutes after midnight the information difference between 23:55 and 00:05 will be very high although it is…

Max
- 51
- 1
5
votes
1 answer
2-dimensional binning with Pandas
So I have two sets of features that I wish to bin (classify) and then combine to create a new feature. It is not unlike classifying coordinates into grids on a map.
The issue is that the features are not evenly distributed and I would like to use…

Reuben L.
- 2,806
- 2
- 29
- 45
5
votes
2 answers
Text features input format for classification algorithms in scikit-learn
I'm starting to use the scikit-learn to do some NLP. I've already used some classifiers from NLTK and now I want to try the ones implemented in scikit-learn.
My data is basically sentences, and I extract features from some words of those sentences…

feralvam
- 1,603
- 2
- 17
- 20