1

This is the code I was using in R to extract the standard deviation of the numeric columns of my dataset. But, the for loop is ending without displaying any output. What is the problem in my code? I know for sure that there are numeric columns in my dataset.

for(col in colnames(stats)){
  if(is.numeric(stats[, col])){
    cat(paste(col, "sd is ", as.character(round(sd(stats[, col]), 2)), '\n'))
  }
}

Structure of my dataframe(stats)

> str(stats)
tibble [3,145 x 11] (S3: tbl_df/tbl/data.frame)
 $ Name                    : chr [1:3145] "A. Urzi" "V. Castellanos" "E. Palacios" "L. Martínez" ...
 $ Age                     : num [1:3145] 19 20 20 21 21 21 21 21 21 21 ...
 $ Nationality             : chr [1:3145] "Argentina" "Argentina" "Argentina" "Argentina" ...
 $ Club                    : chr [1:3145] "Club Athletico Banfield" "New York City FC" "River Plate" "Ajax" ...
 $ Overall                 : num [1:3145] 69 63 77 77 68 73 81 66 66 78 ...
 $ Potential               : num [1:3145] 87 80 87 85 81 87 89 76 79 87 ...
 $ International Reputation: num [1:3145] 1 1 1 1 1 1 1 1 1 1 ...
 $ Skill Moves             : num [1:3145] 3 3 4 3 3 2 4 2 2 4 ...
 $ Team Position           : chr [1:3145] "Attacker" "Attacker" "Midfielder" "Defender" ...
 $ Contract Valid Until    : num [1:3145] 2021 2022 2021 2023 2019 ...
 $ Value in Euros          : num [1:3145] 2.3e+06 8.0e+05 1.4e+07 1.2e+07 1.7e+06 8.0e+06 2.7e+07 9.5e+05 1.2e+06 1.6e+07 ...

> dput(head(stats))
structure(list(Name = c("A. Urzi", "V. Castellanos", "E. Palacios", 
"L. Martínez", "F. Moyano", "C. Romero"), Age = c(19, 20, 20, 
21, 21, 21), Nationality = c("Argentina", "Argentina", "Argentina", 
"Argentina", "Argentina", "Argentina"), Club = c("Club Athletico Banfield", 
"New York City FC", "River Plate", "Ajax", "Argentinos Juniors", 
"Genoa"), Overall = c(69, 63, 77, 77, 68, 73), Potential = c(87, 
80, 87, 85, 81, 87), `International Reputation` = c(1, 1, 1, 
1, 1, 1), `Skill Moves` = c(3, 3, 4, 3, 3, 2), `Team Position` = c("Attacker", 
"Attacker", "Midfielder", "Defender", "Midfielder", "Defender"
), `Contract Valid Until` = c(2021, 2022, 2021, 2023, 2019, 2024
), `Value in Euros` = c(2300000, 8e+05, 1.4e+07, 1.2e+07, 1700000, 
8e+06)), row.names = c(NA, -6L), class = c("tbl_df", "tbl", "data.frame"
))
Ayushmaan
  • 113
  • 4
  • Show `str(stats)`. Make sure `stats` is a data frame or matrix. Make sure it has at least one numeric column. – Gregor Thomas May 19 '20 at 14:55
  • 1
    Also, `paste` will convert its inputs to `character`, so your `as.character()` isn't needed. – Gregor Thomas May 19 '20 at 14:55
  • Can you make this a little more reproducible by providing a sample of records for `stats`? See [How to make a great R reproducible example](https://stackoverflow.com/q/5963269/2572423) – JasonAizkalns May 19 '20 at 14:56
  • `stats <- iris` followed by your code works as expected, I'm voting to close as simple typo/not reproducible. – Rui Barradas May 19 '20 at 15:04
  • 1
    Wrapping `stats` with `as.data.frame` -- as in, `as.data.frame(stats)` will fix the issue, but I'm currently trying to understand why. Something to do with `stats` being a `tibble` object. If you're using `dplyr`, it's probably easier to just write: `summarise_if(stats, is.numeric, sd)` – JasonAizkalns May 19 '20 at 15:20

1 Answers1

2

The driver of the confusion here is that stats is a tibble:

class(stats)
[1] "tbl_df"     "tbl"        "data.frame"

When you subset a tibble, i.e., tbl_df object, via [, the result is another tibble object. Consider the differences in the following with the first "numeric" column, Overall:

class(stats[, "Overall"])
[1] "tbl_df"     "tbl"        "data.frame"

This is different from a data.frame:

class(as.data.frame(stats)[, "Overall"])
[1] "numeric"

This is because the default behavior for subsetting a data.frame in base R is to simplify any results that return a single column to a vector. We can avoid this behavior with drop = FALSE:

class(as.data.frame(stats)[, "Overall", drop = FALSE])
[1] "data.frame"

Likewise, and perhaps unexpectedly:

is.numeric(stats[, "Overall"])
[1] FALSE
is.numeric(as.data.frame(stats)[, "Overall"])
[1] TRUE
is.numeric(as.data.frame(stats)[, "Overall", drop = FALSE])
[1] FALSE

And for good measure, but likely adding to the confusion, checkout when you use a double bracket to subset [[:

class(stats[["Overall"]])
[1] "numeric"
is.numeric(stats[["Overall"]])
[1] TRUE

So if you want to use your code "as-is", you could convert the tbl_df back to a plain-vanilla data.frame in the appropriate spots:

for(col in colnames(stats)) {
  if(is.numeric(as.data.frame(stats)[, col])) {
    cat(paste(col, "sd is", round(sd(as.data.frame(stats)[, col]), 2), '\n'))
  }
}

Alternatively, you could use [[:

for(col in colnames(stats)) {
  if(is.numeric(stats[[col]])) {
    cat(paste(col, "sd is", round(sd(stats[[col]]), 2), '\n'))
  }
}

Finally, since I am assuming you are using the tidyverse because this data was formatted as a tibble, a more tidyverse-flavored approach could be:

library(dplyr)
library(glue)

stats %>%
  summarise_if(is.numeric, sd) %>% 
  glue_data("{colnames(.)} sd is {round(., 2)}")

Age sd is 0.82
Overall sd is 5.53
Potential sd is 3.21
International Reputation sd is 0
Skill Moves sd is 0.63
Contract Valid Until sd is 1.75
Value in Euros sd is 5690577.01

The lesson to be learned here is that you should get into the habit of using drop = FALSE if you want to subset a data.frame with [ , ]. Here's a nice blog post with more details and explanation as to why.

JasonAizkalns
  • 20,243
  • 8
  • 57
  • 116