1

I have a data frame like this:

df <- data.frame(x1=c(1, 2, 3, 2, 1),
                 x2=c(1, 10, 5, 8, 3))

And I'm trying to normalize both variables between 0 and 1. So 2 in x1 would be 0.5 and 5 in x2 would also be 0.5.

I have tried using the following normalization function:

range01 <- function(x){(x-min(x, na.rm = T))/(max(x, na.rm = T)-min(x, na.rm = T))}
df <- range01(df)

But instead it normalizes all variables by range of the entire data frame (1 to 10), giving this:

x1          x2
0.0000000   0.0000000           
0.1111111   1.0000000           
0.2222222   0.4444444           
0.1111111   0.7777778           
0.0000000   0.2222222

How can I normalize both columns by their individual range? I need a systematic function to do this, since I am working with many variables across many data frames in a for loop.

Marco Pastor Mayo
  • 803
  • 11
  • 25
  • 1
    Possible duplicate of [scaling r dataframe to 0-1 with NA values](https://stackoverflow.com/questions/31926022/scaling-r-dataframe-to-0-1-with-na-values) – markus Dec 01 '18 at 20:36

3 Answers3

7

I think you can do in one line:

sapply(df, function(x) (x - min(x, na.rm = T)) / (max(x, na.rm = T) - min(x, na.rm=T)))

      x1        x2
[1,] 0.0 0.0000000
[2,] 0.5 1.0000000
[3,] 1.0 0.4444444
[4,] 0.5 0.7777778
[5,] 0.0 0.2222222
YOLO
  • 20,181
  • 5
  • 20
  • 40
3

With base R:

apply(df, 2, function(x) {(x - min(x, na.rm = T))/(max(x, na.rm = T) - min(x, na.rm = T))})

      x1        x2
[1,] 0.0 0.0000000
[2,] 0.5 1.0000000
[3,] 1.0 0.4444444
[4,] 0.5 0.7777778
[5,] 0.0 0.2222222

Or with dplyr:

df %>%
 mutate_at(vars(starts_with("x")), 
           funs((. - min(., na.rm = T))/(max(., na.rm = T) - min(., na.rm = T)))) #Applying the function to vars that starts with "x"

   x1        x2
1 0.0 0.0000000
2 0.5 1.0000000
3 1.0 0.4444444
4 0.5 0.7777778
5 0.0 0.2222222

Or a different dplyr solution, applying the function to all columns:

df %>%
 mutate_all(funs((. - min(., na.rm = T))/(max(., na.rm = T) - min(., na.rm = T))))

Or with data.table:

setDT(df)[ , lapply(.SD, function(x) (x - min(x, na.rm = T))/(max(x, na.rm = T) - min(x, na.rm = T)))]

    x1        x2
1: 0.0 0.0000000
2: 0.5 1.0000000
3: 1.0 0.4444444
4: 0.5 0.7777778
5: 0.0 0.2222222
tmfmnk
  • 38,881
  • 4
  • 47
  • 67
2

Another option based on the scales package

library("scales")
df <- data.frame(x1=c(1, 2, 3, 2, 1),
         x2=c(1, 10, 5, 8, 3))
sapply(df, rescale)

The default option is the 0-1 range but you can also pass other ranges (e.g. 0-100)

 sapply(df, rescale, to = c(0, 100))
paoloeusebi
  • 1,056
  • 8
  • 19