0

Hey awesome community,

I am trying to learn how to use loops to loop through aspects of a dataset. I'm using the sns data set provided free for machine learning and trying to run a k means cluster analysis. The first thing I need to do is to center and scale the variables. I want to do this using a loop, and I need to select all but the first four variables in the data set. Here's what I tried, and I'm not sure why this doesn't work:

for(i in names(sns.nona[, -c(1:4)])){
    scale(i, center = TRUE, scale = TRUE)
}

Error in colMeans(x, na.rm = TRUE) : 'x' must be numeric

I get the above error, which must mean it's not selecting the actual column of the data set, just the name. I guess I should expect that, but how do I make it reference the data?

edit: I also tried:

for(i in names(sns.nona)[-c(1:4)]){
    scale(sns.nona[,i], center = TRUE, scale = TRUE)
}

This did not return an error but it does not appear to be centering the data. I should get some negative values if the original value was 0 as I'd be subtractign the column mean from it...

Michael
  • 111
  • 9
  • is there some reason you need to use a loop? – Dij May 12 '19 at 02:10
  • 1
    Only for my own edification. I tried to do this using tidyr as well, but when i added one_of(-c(v1, v2, v3, v4)) it said it couldn't find v1 for whatever reason. I actually wouldn't mind seeing it both ways. This is just about learning for me. :) – Michael May 12 '19 at 02:17
  • your last code should have worked, if the mean was zero then scaling it would not always produce a negative value. or if the mean was negative, it could produce a positive value! Anyway, as far using tidyr, the reason you got that error is because tidyr doesn't require you to enquote the variable names. so if your variable name is `Blah` then you can do `data %>% select(Blah) %>% transmute(Blah = scale())` – Dij May 12 '19 at 02:27

3 Answers3

3

A way to do this avoiding writing a loop:

scale(data[-1:-4])

Also, if you want to do this while enabling yourself to modify the selected columns without creating a new data frame:

data[-1:-4] <- lapply(data[-1:-4], scale)
Dij
  • 1,318
  • 1
  • 7
  • 13
2

You might need to assign the result back after applying scale

for(i in names(df)[-(1:4)]){
   df[, i] <- scale(df[,i], center = TRUE, scale = TRUE)
}

Or with lapply you could do

df[-(1:4)] <- lapply(df[-(1:4)], scale, center = TRUE, scale = TRUE)

and with dplyr we can . use mutate_at

library(dplyr)
df %>%  mutate_at(-(1:4), scale, center = TRUE, scale = TRUE)
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • Foudn the genius. Yup that works. That seems like a pretty basic concept. I will keep it in mind. Thank you so much! – Michael May 12 '19 at 02:25
2

You could use the tidyverse family of packages, which is what I use for pretty much everything I do in R. It's never too early to start using them imo.

require(tidyverse)
#Convert sns.nona to tibble (robust data format which we can do cool stuff to)
sns.nona = as.tibble(sns.nona) 
#Do cool stuff: mutate_at("columns to change","function to apply to columns")
sns.nona = sns.nona %>% 
mutate_at(5:(ncol(sns.nona)),function(x) scale(x, center = T, scale = T))

NB don't be alarmed by the %>%. Basically x %>% function(y,z) is equivalent to function(x,y,z)

Captain Hat
  • 2,444
  • 1
  • 14
  • 31
  • I have a very basic familiarity with tidyverse. This is very helpful. I know the mutate function is very efficient, so it's good to see examples! – Michael May 12 '19 at 03:05