0

I'm working on the Boston Housing dataset. I filtered the observations (towns) having the lowest 'medv' and saved them after transposing to a new dataframe. I want to insert column in this new dataframe that contains the percentiles based on the original data for the feature values of these filtered observations. Here's the R code:

# load the library containing the dataset
library(MASS)

# save the data with custom name
boston = Boston

# suburb with lowest medv
low.medv = data.frame(t(boston[boston$medv == min(boston$medv),]))
low.medv

enter image description here

# The values I want populated in new columns:

# Finding percentile rank for crim
ecdf(boston$crim)(38.3518)
# >>> 0.9881423
ecdf(boston$crim)(67.9208)
# >>> 0.9960474

# percentile rank for lstat
ecdf(boston$lstat)(30.59)
# >>> 0.9782609
ecdf(boston$lstat)(22.98)
# >>> 0.8992095

Desired output :

enter image description here

Is there a way to use the ecdf function with sapply?

rahul-ahuja
  • 1,166
  • 1
  • 12
  • 24

1 Answers1

1

I think it would be easier if you don't transpose the data beforehand :

low.medv <- boston[boston$medv == min(boston$medv),]
res <- mapply(function(x, y) ecdf(x)(y), boston, low.medv)
res
#       crim     zn  indus   chas    nox      rm age     dis rad
#[1,] 0.9881 0.7352 0.8874 0.9308 0.8577 0.07708   1 0.05731   1
#[2,] 0.9960 0.7352 0.8874 0.9308 0.8577 0.13636   1 0.04150   1
#        tax ptratio  black  lstat     medv
#[1,] 0.9901  0.8893 1.0000 0.9783 0.003953
#[2,] 0.9901  0.8893 0.3498 0.8992 0.003953

Now, if you want the result as shown in 4-columns you can do :

cbind(t(low.medv), t(res))
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
  • Thanks. It worked. Would you please explain how is mapply proceeding? I'm unfamiliar with that function. – rahul-ahuja Sep 02 '20 at 03:36
  • 1
    `mapply`/`Map` works column-wise on both the dataframe meaning in 1st iteration, 1st column of `boston` is x and 1st column of `low.medv` is `y`, in 2nd iteration it passes 2nd column of both the dataframe in the function and so on for all the columns. – Ronak Shah Sep 02 '20 at 03:44
  • 1
    Also, why do some people prefer to use <- instead of = for assignment? Are there any advantages or is it just that they've done it so from the beginning? – rahul-ahuja Sep 02 '20 at 03:44
  • 1
    There is not much difference between the two. That is more like a coding style which they have developed over the years. You can check this post for detailed discussion https://stackoverflow.com/questions/1741820/what-are-the-differences-between-and-assignment-operators-in-r – Ronak Shah Sep 02 '20 at 03:53