Filling missing levels

Question

I have the following type of dataframe:

Country <- rep(c("USA", "AUS", "GRC"),2)
Year    <- 2001:2006
Level   <- c("rich","middle","poor",rep(NA,3))
df <- data.frame(Country, Year,Level)

df 
Country Year  Level
1     USA 2001   rich
2     AUS 2002 middle
3     GRC 2003   poor
4     USA 2004   <NA>
5     AUS 2005   <NA>
6     GRC 2006   <NA>

I want to fill the missing values with the correct level label in the last from the right column.

So the expected outcome should be like this:

Country Year  Level
1     USA 2001   rich
2     AUS 2002 middle
3     GRC 2003   poor
4     USA 2004   rich
5     AUS 2005 middle
6     GRC 2006   poor

Related: [*Replacing NAs with latest non-NA value*](https://stackoverflow.com/questions/7735647/replacing-nas-with-latest-non-na-value) — Jaap, Dec 21 '17 at 21:18
Also related: [_Replace missing values (NA) with most recent non-NA by group_](https://stackoverflow.com/q/23340150/4497050) — alistaire, Dec 22 '17 at 21:03

Rich Scriven · Answer 1 · 2017-12-21T21:09:02.987

In base R, you could use ave():

transform(df, Level = ave(Level, Country, FUN = na.omit))

#   Country Year  Level
# 1     USA 2001   rich
# 2     AUS 2002 middle
# 3     GRC 2003   poor
# 4     USA 2004   rich
# 5     AUS 2005 middle
# 6     GRC 2006   poor

Another, more accurate possibility is to use a join. Here we merge the Country column with the NA-omitted data. The outcome is the same, just in a different row order.

merge(df["Country"], na.omit(df))

#   Country Year  Level
# 1     AUS 2002 middle
# 2     AUS 2002 middle
# 3     GRC 2003   poor
# 4     GRC 2003   poor
# 5     USA 2001   rich
# 6     USA 2001   rich

I get this error message "Error in x[i] <- value[[j]] : replacement has length zero" when I apply it to my own dataframe — msh855, Dec 24 '17 at 10:29

akrun · Accepted Answer · 2017-12-24T08:11:40.813

9

We can group by 'Country' and get the non-NA unique value

library(dplyr)
df %>%
    group_by(Country) %>% 
    dplyr::mutate(Level = Level[!is.na(Level)][1])
# A tibble: 6 x 3
# Groups:   Country [3]
#  Country  Year  Level
#   <fctr> <int> <fctr>
#1     USA  2001   rich
#2     AUS  2002 middle
#3     GRC  2003   poor
#4     USA  2004   rich
#5     AUS  2005 middle
#6     GRC  2006   poor

If we have loaded dplyr along with plyr, it is better to specify explicitly dplyr::mutate or dplyr::summarise so that it uses the function from dplyr. There are same functions in plyr and it could potentially mask the functions from dplyr when both are loaded creating different behavior.

edited Dec 24 '17 at 08:11

answered Dec 21 '17 at 18:39

akrun

874,273
37
540
662

3

Nice thought. Good job. But aren't we not making an assumption here there will always be a non na unique value? Just trying to understand.. – Abdul Rahman K Dec 21 '17 at 18:57
@AbdulRahmanK Thanks, It seems like the OP have a single unique value. If there are multiple, values, perhaps `mutate(Level = toString(unique(Level[!is.na(Level)])))` – akrun Dec 21 '17 at 19:39
@akrun , I think your answer did not work. Even for the MWE I provided the outcome was not the expected one. because of this [1] and maps all countries to "rich". Tried to find a solution, but nothing worked so far. – msh855 Dec 24 '17 at 00:02
@msh855 It is not clear. Based on the example, it is working – akrun Dec 24 '17 at 05:25
@akrun I don't know ! but when I use my example with your suggested code, the third column 'Level' is filled with only the value/level 'rich'. So, it is like replacing all values with this. – msh855 Dec 24 '17 at 07:41
@msh855 As I mentioned, your example is working fine with this. Is it possible that you also loaded `plyr`. Try with `df %>% group_by(Country) %>% dplyr::mutate(Level = Level[!is.na(Level)][1])` – akrun Dec 24 '17 at 07:42
@akrun, ok you are right. once dplyr is updated it worked. But, let me trying now to my own dataframe. You have any idea why by loading dplyr it worked ? – msh855 Dec 24 '17 at 07:46
1

@msh855 This is one of the problems of functions masking others. the package `plyr` have `summarise/mutate` etc which could mask this dplyr `summarise/mutate` and when you call this, it could be calling the `plyr` mutate instead. It is better to use `dplyr::` to specify that you wanted the function from that package. – akrun Dec 24 '17 at 07:47
@akrun finaly it worked. I re-accepted your answer. but, maybe you wish to update it making this point about dplyr more clear to everyone. Thank you for your help. – msh855 Dec 24 '17 at 08:09

score 8 · Answer 3 · answered Dec 21 '17 at 18:39

You can do it using data.table and zoo:-

library(data.table)
library(zoo)
setDT(df)
df[, Level := na.locf(Level), by = Country]

This will give you:-

   Country Year  Level
1:     USA 2001   rich
2:     AUS 2002 middle
3:     GRC 2003   poor
4:     USA 2004   rich
5:     AUS 2005 middle
6:     GRC 2006   poor

eipi10 · Answer 4 · 2017-12-21T18:45:41.947

library(dplyr)

df %>% 
  group_by(Country) %>% 
  mutate(Level = replace(Level, is.na(Level), unique(na.omit(Level))))

  Country  Year  Level
   <fctr> <int> <fctr>
1     USA  2001   rich
2     AUS  2002 middle
3     GRC  2003   poor
4     USA  2004   rich
5     AUS  2005 middle
6     GRC  2006   poor

Or, more succinctly, applying @suchait's idea to use na.locf:

df %>% 
  group_by(Country) %>% 
  mutate(Level = zoo::na.locf(Level))

score 1 · Answer 5 · answered Dec 21 '17 at 20:27

1

A solution using dplyr and tidyr.

library(dplyr)
library(tidyr)

df %>%
  arrange(Country) %>%
  fill(Level) %>%
  arrange(Year)
#   Country Year  Level
# 1     USA 2001   rich
# 2     AUS 2002 middle
# 3     GRC 2003   poor
# 4     USA 2004   rich
# 5     AUS 2005 middle
# 6     GRC 2006   poor

answered Dec 21 '17 at 20:27

www

38,575
12
48
84

strrep(letters[23], 3) – AidanGawronski Dec 21 '17 at 22:58
@AidanGawronski Cool! Thanks for sharing. – www Dec 21 '17 at 23:30

score 1 · Answer 6 · answered Dec 22 '17 at 08:04

Here is another data.table solution which updates on join using a lookup table which is created from the given dataset itself:

library(data.table)
setDT(df)[df[!is.na(Level)], on = .(Country), Level := Level][]

   Country Year  Level
1:     USA 2001   rich
2:     AUS 2002 middle
3:     GRC 2003   poor
4:     USA 2004   rich
5:     AUS 2005 middle
6:     GRC 2006   poor

Filling missing levels

6 Answers6