Questions tagged [data-preprocessing]

Preprocessing can be the structuring from raw data and cleaning to be actually usable up to transforming data so that it can be handled by algorithms or improve their results. Preferably also tags for specific methods should be used. This tag should be used for meaningful preprocessing steps in a data pipeline, prior to algorithms or as a standalone method.

Data preprocessing is applicable to multiple stages in which data can persist. This can be on a higher level right before more meaningful processing steps like analysis takes place.
But preprocessing also starts when raw data is generated and must be brought into a meaningful and usable format. Currently the tag data-manipulation fits this lower level description better, likewise data-structures if the structure of how the data is stored and queried is important. Finding errors, missing values and how to handle them can are also major part of it. For that prefer to use the tag data-cleaning and/or data-wrangling.

This tag data-preprocessing should focus more on the rearrangement and transformation of data to be usable by algorithms or improve their results. Examples for preprocessing are encoding of data, their scaling or normalization of a already formatted dataset.

Preprocessing algorithms and techniques can be found in scikit-learn modules Preprocessing and Normalization:

Further theory and examples for the necessity of data preprocessing is discussed in section scikit-learn - Preprocessing data.

488 questions

votes

4 answers

How to restore the original feature names in XGBoost feature importance plot (after preprocessing removed them)?

Preprocessing the training data (such as centering or scaling) before training an XGBoost model, can lead to a loss of feature names. Most answers on SO suggest training the model in such a way that feature names aren't lost (such as using…

python pandas dataframe xgboost data-preprocessing

asked Feb 28 '19 at 20:33

user11086563

votes

0 answers

Variable importance by "mlr3filters" does not work in "mlr3proba" after preprocessing data with "mlr3pipelines"

Running the code below using mlr3proba and mlr3pipelines and mlr3filters packages of R to implement rpart algorithm on a preporcessed dataset and performing "variable importance", shows an error: task <- tsk("iris") learner <-…

machine-learning mlr3 data-preprocessing

asked May 04 '21 at 08:32

user15779336

votes

1 answer

What tensorflow's flat_map + window.batch() does to a dataset/array?

I'm following one of the online courses about time series predictions using Tensorflow. The function used to convert Numpy array (TS) into a Tensorflow dataset used is LSTM-based model is already given (with my comment lines): def…

python tensorflow tensorflow-datasets data-preprocessing

asked Feb 21 '22 at 18:05

Roberto

votes

1 answer

ValueError on inverse transform using OrdinalEncoder with dictionary

I can transform the target column to desired ordered numerical value using categorical encoding and ordinal encoding. But I am unable to perform inverse_transform as an error is showing which is written below. import pandas as pd import…

python pandas machine-learning scikit-learn data-preprocessing

asked Nov 24 '21 at 12:30

Amit Tiwari

votes

1 answer

Error while downloading the dataframe from Streamlit web application after data preprocessing

The required task is deploy the data preprocessing web application on Streamlit in which user can upload the raw dataframe and download the processed dataframe. I am trying to download the file on which data preprocessing like missing value…

python download file-handling streamlit data-preprocessing

asked Sep 17 '21 at 19:18

Python Learner

votes

1 answer

save python output as pdf?

I am using python notebook for the EDA and Data Science . For that I often work with the dataprep library . I want to save the report that has been created using that library into the pdf format.

python data-science data-preprocessing exploratory-data-analysis

asked Aug 16 '21 at 09:11

NITISH PANDEY

votes

1 answer

How can I replace emojis with text and treat them as single words?

I have to do a topic modeling based on pieces of texts containing emojis with R. Using the replace_emoji() and replace_emoticon functions let me analyze them, but there is a problem with the results. A red heart emoji is translated as "red heart…

r emoji topic-modeling data-preprocessing

asked May 17 '21 at 19:55

TR_IBK21

votes

0 answers

Stratified train/val/test split in Pytorch

I have an image classification dataset with 6 categories that I'm loading using the torchvision ImageFolder class. I have written the below to split the dataset into 3 sets in a stratified manner: from torch.utils.data import Subset from…

python pytorch torchvision image-classification data-preprocessing

asked Jul 24 '22 at 10:47

ali_is

votes

1 answer

Is it a bad idea to always standardize all features by default?

Is there a reason not to standardize all features by default? I realize it may not be necessary for e.g., decision trees but for certain algorithms such as KNN, SVM and K-Means. Would there be any harm just routinely to do this for all of my…

machine-learning scikit-learn normalize data-preprocessing

asked Feb 21 '21 at 14:37

Levon

138,105
33
200
191

votes

1 answer

The calculated Robustscaler in sklearn seems not right

I tried the Robustscaler in sklearn, and found the results are not the same as the formula. The formula of the Robustscaler in sklearn is: I have a matrix shown as below: I test the first data in feature one (row one and column one). The scaled…

python scikit-learn data-preprocessing

asked Feb 06 '21 at 02:02

ZH. Yang

votes

3 answers

How to access an Excel file on Onedrive with Openpyxl

I want to open an Excel file (on Onedrive) with Openpyxl (Python). I received error trying this: from openpyxl import load_workbook file = r"https://d.docs.live.net/dd10xxxxxxxxxx" wb = load_workbook(filename = file) self.fp = io.open(file,…

python-3.x excel openpyxl data-preprocessing

asked Dec 03 '18 at 01:52

René Langevin

votes

0 answers

How to create consistent time series dataset using an inconsistent time series dataset?

I am new to data science and machine learning, and I am working on a project involving time series data from wearable devices (using Python programming environment). I have the sampling frequency of each sensor modality for each device. Some of the…

python pandas time-series data-science data-preprocessing

asked Jan 06 '23 at 04:04

Darshana Sandaruwan

votes

0 answers

Getting 'UnidentifiedImageError: cannot identify image file error' while converting pdf to image on google colab

I am using pdf2image to convert a pdf file into image. I am using the method convert_from_path. However, I keep getting the above-mentioned error on google colab. Surprisingly this does not happen when I execute the same code on Jupyter notebook on…

nlp google-colaboratory data-preprocessing pdf2image

asked Jul 15 '22 at 07:23

Ace Purohit

votes

1 answer

how to add text preprocessing tokenization step into Tensorflow model

I have a TensorFlow model SavedModel which includes saved_model.pb and variables folder. The preprocessing step has not been incorporated into this model that's why I need to do preprocessing(Tokenization etc) before feeding the data to the model…

tensorflow machine-learning deep-learning data-preprocessing

asked Jul 13 '22 at 22:34

sariii

2,020
6
29
57

votes

1 answer

How to adapt TextVectorization layer on tf.Dataset

I load my dataset like this: self.train_ds = tf.data.experimental.make_csv_dataset( self.config["input_paths"]["data"]["train"], batch_size=self.params["batch_size"], shuffle=False, label_name="tags", …

python tensorflow keras data-preprocessing

asked Jun 20 '22 at 14:49

1231fdgsg78sdf7

2 3

…

32 33 Next