Questions tagged [data-preprocessing]

Preprocessing can be the structuring from raw data and cleaning to be actually usable up to transforming data so that it can be handled by algorithms or improve their results. Preferably also tags for specific methods should be used. This tag should be used for meaningful preprocessing steps in a data pipeline, prior to algorithms or as a standalone method.

Data preprocessing is applicable to multiple stages in which data can persist. This can be on a higher level right before more meaningful processing steps like analysis takes place.
But preprocessing also starts when raw data is generated and must be brought into a meaningful and usable format. Currently the tag fits this lower level description better, likewise if the structure of how the data is stored and queried is important. Finding errors, missing values and how to handle them can are also major part of it. For that prefer to use the tag and/or .

This tag should focus more on the rearrangement and transformation of data to be usable by algorithms or improve their results. Examples for preprocessing are encoding of data, their scaling or normalization of a already formatted dataset.

Preprocessing algorithms and techniques can be found in scikit-learn modules Preprocessing and Normalization:

Further theory and examples for the necessity of data preprocessing is discussed in section scikit-learn - Preprocessing data.

488 questions
24
votes
4 answers

How to restore the original feature names in XGBoost feature importance plot (after preprocessing removed them)?

Preprocessing the training data (such as centering or scaling) before training an XGBoost model, can lead to a loss of feature names. Most answers on SO suggest training the model in such a way that feature names aren't lost (such as using…
user11086563
6
votes
0 answers

Variable importance by "mlr3filters" does not work in "mlr3proba" after preprocessing data with "mlr3pipelines"

Running the code below using mlr3proba and mlr3pipelines and mlr3filters packages of R to implement rpart algorithm on a preporcessed dataset and performing "variable importance", shows an error: task <- tsk("iris") learner <-…
user15779336
5
votes
1 answer

What tensorflow's flat_map + window.batch() does to a dataset/array?

I'm following one of the online courses about time series predictions using Tensorflow. The function used to convert Numpy array (TS) into a Tensorflow dataset used is LSTM-based model is already given (with my comment lines): def…
Roberto
  • 649
  • 1
  • 8
  • 22
5
votes
1 answer

ValueError on inverse transform using OrdinalEncoder with dictionary

I can transform the target column to desired ordered numerical value using categorical encoding and ordinal encoding. But I am unable to perform inverse_transform as an error is showing which is written below. import pandas as pd import…
5
votes
1 answer

Error while downloading the dataframe from Streamlit web application after data preprocessing

The required task is deploy the data preprocessing web application on Streamlit in which user can upload the raw dataframe and download the processed dataframe. I am trying to download the file on which data preprocessing like missing value…
5
votes
1 answer

save python output as pdf?

I am using python notebook for the EDA and Data Science . For that I often work with the dataprep library . I want to save the report that has been created using that library into the pdf format.
5
votes
1 answer

How can I replace emojis with text and treat them as single words?

I have to do a topic modeling based on pieces of texts containing emojis with R. Using the replace_emoji() and replace_emoticon functions let me analyze them, but there is a problem with the results. A red heart emoji is translated as "red heart…
TR_IBK21
  • 67
  • 4
4
votes
0 answers

Stratified train/val/test split in Pytorch

I have an image classification dataset with 6 categories that I'm loading using the torchvision ImageFolder class. I have written the below to split the dataset into 3 sets in a stratified manner: from torch.utils.data import Subset from…
4
votes
1 answer

Is it a bad idea to always standardize all features by default?

Is there a reason not to standardize all features by default? I realize it may not be necessary for e.g., decision trees but for certain algorithms such as KNN, SVM and K-Means. Would there be any harm just routinely to do this for all of my…
Levon
  • 138,105
  • 33
  • 200
  • 191
4
votes
1 answer

The calculated Robustscaler in sklearn seems not right

I tried the Robustscaler in sklearn, and found the results are not the same as the formula. The formula of the Robustscaler in sklearn is: I have a matrix shown as below: I test the first data in feature one (row one and column one). The scaled…
ZH. Yang
  • 43
  • 5
4
votes
3 answers

How to access an Excel file on Onedrive with Openpyxl

I want to open an Excel file (on Onedrive) with Openpyxl (Python). I received error trying this: from openpyxl import load_workbook file = r"https://d.docs.live.net/dd10xxxxxxxxxx" wb = load_workbook(filename = file) self.fp = io.open(file,…
3
votes
0 answers

How to create consistent time series dataset using an inconsistent time series dataset?

I am new to data science and machine learning, and I am working on a project involving time series data from wearable devices (using Python programming environment). I have the sampling frequency of each sensor modality for each device. Some of the…
3
votes
0 answers

Getting 'UnidentifiedImageError: cannot identify image file error' while converting pdf to image on google colab

I am using pdf2image to convert a pdf file into image. I am using the method convert_from_path. However, I keep getting the above-mentioned error on google colab. Surprisingly this does not happen when I execute the same code on Jupyter notebook on…
3
votes
1 answer

how to add text preprocessing tokenization step into Tensorflow model

I have a TensorFlow model SavedModel which includes saved_model.pb and variables folder. The preprocessing step has not been incorporated into this model that's why I need to do preprocessing(Tokenization etc) before feeding the data to the model…
sariii
  • 2,020
  • 6
  • 29
  • 57
3
votes
1 answer

How to adapt TextVectorization layer on tf.Dataset

I load my dataset like this: self.train_ds = tf.data.experimental.make_csv_dataset( self.config["input_paths"]["data"]["train"], batch_size=self.params["batch_size"], shuffle=False, label_name="tags", …
1
2 3
32 33