Manipulate values of a subset of dataframe columns

Question

I want to standardize a number of columns in a dataframe, but not all columns. The columns to be manipulated are specified in a vector.

To illustrate, take the following simulated dataframe:

set.seed(1)
mydf <- data.frame(matrix(sample(100, 36, replace = TRUE), nrow = 12))

Defining the two columns to be manipulated (note that the solution should apply to a subset of columns defined by their names, not their dataframe number):

variables <- c("X1", "X2")

Now I wrote the following loop to standardize the two columns, which throws me an error.

for (i in seq_along(variables)) {
  mydf[variables[i]] <- ((mydf[variables[i]] - mean(mydf[variables[i]], na.rm = TRUE)) / sd(mydf[variables[i]], na.rm = TRUE))
}

What is the correct way to do this? (I am a beginner to R.)

score 2 · Accepted Answer · answered Oct 11 '18 at 08:52

2

You can use scale, and you do not need a loop:

mydf[variables] <- scale(mydf[variables])

answered Oct 11 '18 at 08:52

Sven Hohenstein

80,497
17
145
168

Hunaidkhan · Answer 2 · 2018-10-11T08:56:32.050

0

standardize feature from mlr package will help you.

set.seed(1)
mydf <- data.frame(matrix(sample(100, 36, replace = TRUE), nrow = 12))

colnames(mydf)
library(mlr)
trainTask <- normalizeFeatures(mydf[c( "X1","X2" )],method = "standardize")

edited Oct 11 '18 at 08:56

answered Oct 11 '18 at 08:35

Hunaidkhan

1,411
2
11
21

Very elegant, however, the solution needs to apply to a subset of columns that is known by column names (not their dataframe position). This was unclear from my question and I edited accordingly. – broti Oct 11 '18 at 08:51
updated the solution with specific column names – Hunaidkhan Oct 11 '18 at 08:56
Following up my previous comment, replacing `mydf[c(1,2)]` with `mydf[variables]` in your last line of code does what I want. – broti Oct 11 '18 at 09:01
you just have to change it to variables thats not a big deal. – Hunaidkhan Oct 11 '18 at 13:39

markus · Answer 3 · 2018-10-11T08:55:47.553

0

To get your loop working use [[ instead of [ because mean and sd expect a vector.

for (i in seq_along(variables)) {
  mydf[variables[i]] <-
    ((mydf[variables[i]] - mean(mydf[[variables[i]]], na.rm = TRUE)) / sd(mydf[[variables[i]]], na.rm = TRUE))
}

But consider to use scale, see @SvenHohenstein's answer.

edited Oct 11 '18 at 08:55

answered Oct 11 '18 at 08:36

markus

25,843
5
39
58

1

Thanks for the clarification. I agree that using `scale` is most straightforward. – broti Oct 11 '18 at 09:04

Manipulate values of a subset of dataframe columns

3 Answers3