2

I'm a beginner in R and I would like to know how to do the following task:

I want to replace the missing values of my dataset by the median for all the columns of my dataset. However, for each column, I want the median of a certain category (depending on another column).My dataset is as follows

structure(list(Country = structure(1:5, .Label = c("Afghanistan", 
"Albania", "Algeria", "Andorra", "Angola"), class = "factor"), 
    CountryID = 1:5, Continent = c(1L, 2L, 3L, 2L, 3L), Adolescent.fertility.rate.... = c(151L, 
    27L, 6L, NA, 146L), Adult.literacy.rate.... = c(28, 98.7, 
    69.9, NA, 67.4)), class = "data.frame", row.names = c(NA, 
-5L))

So for each of the columns, I want to replace the missing values by the median of the values in the specific continent.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
AYLA
  • 41
  • 5
  • 3
    Hi Ayla and welcome to Stack Overflow. If you post pictures of your data instead of just copying and pasting the text into your question, then anyone who wants to help you would have to sit and type out all your data again. Please help us to help you by editing your question to include the data in text format. – Allan Cameron Mar 06 '20 at 13:18
  • 1
    Images are a really bad way of posting data (or code). Can you post sample data in `dput` format? Please edit **the question** with the output of `dput(df)`. Or, if it is too big with the output of `dput(head(df, 20))`. (`df` is the name of your dataset.) – Rui Barradas Mar 06 '20 at 13:20
  • 1
    Welcome to Stack Overflow. Help us to help you. Here is how to provide data that others can work with. https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example Also, how would the expected outcome would look like? – Janhoo Mar 06 '20 at 13:21
  • Thank you guys, for the note also. Is it now ok? @AllanCameron – AYLA Mar 06 '20 at 13:44

2 Answers2

1

Here is a solution using the library dplyr. I called your dataframe ww and renamed your column:

library('dplyr')
ww %>% 
  rename(
    lit_rate = Adult.literacy.rate....
  ) %>% 
  group_by(
    Continent
  ) %>% 
  mutate(
    lit_rate = replace(
      lit_rate,
      is.na(lit_rate),
      median(lit_rate, na.rm = TRUE)
    )
  ) %>% 
  ungroup()
svenhalvorson
  • 1,090
  • 8
  • 21
1

We can use dplyr::mutate_at to replace NAs in each column (except Continent and the non numeric column Country) with the median for its Continent group

df <- structure(list(Country = structure(1:5, .Label = c("Afghanistan",  "Albania", "Algeria", "Andorra", "Angola"), class = "factor"), 
               CountryID = 1:5, Continent = c(1L, 2L, 3L, 2L, 3L),
               Adolescent.fertility.rate.... = c(151L, 27L, 6L, NA, 146L),
               Adult.literacy.rate.... = c(28, 98.7, 69.9, NA, 67.4)), class = "data.frame", row.names = c(NA, -5L))

library(dplyr)
df %>%
  group_by(Continent) %>% 
  mutate_at(vars(-group_cols(), -Country), ~ifelse(is.na(.), median(., na.rm = TRUE), .)) %>% 
  ungroup()

Returns:

  # A tibble: 5 x 5
    Country     CountryID Continent Adolescent.fertility.rate.... Adult.literacy.rate....
    <fct>           <int>     <int>                         <int>                   <dbl>
  1 Afghanistan         1         1                           151                    28  
  2 Albania             2         2                            27                    98.7
  3 Algeria             3         3                             6                    69.9
  4 Andorra             4         2                            27                    98.7
  5 Angola              5         3                           146                    67.4

Explanation: First we group the data.frame df by Continent. Then we mutate all columns except the grouping column (and Country which is not numeric) the following way: If is.na is TRUE, we replace it with the median, and since we are grouped, it's going to be the median for the Continent group (if its not NA we replace it with itself). Finally we call ungroup for good measure to get back a 'normal' tibble.

dario
  • 6,415
  • 2
  • 12
  • 26
  • How can I save the resulting output? Thank you a lot for your help! – AYLA Mar 06 '20 at 15:27
  • 1
    We can assign the expression to a variable i.e. `new_df <- df %>% group_by(Continent) %>% mutate_at(vars(-group_cols(), -Country), ~ifelse(is.na(.), median(., na.rm = TRUE), .)) %>% ungroup()` – dario Mar 06 '20 at 15:29