12

I have the following type of dataframe:

Country <- rep(c("USA", "AUS", "GRC"),2)
Year    <- 2001:2006
Level   <- c("rich","middle","poor",rep(NA,3))
df <- data.frame(Country, Year,Level)

df 
Country Year  Level
1     USA 2001   rich
2     AUS 2002 middle
3     GRC 2003   poor
4     USA 2004   <NA>
5     AUS 2005   <NA>
6     GRC 2006   <NA>

I want to fill the missing values with the correct level label in the last from the right column.

So the expected outcome should be like this:

Country Year  Level
1     USA 2001   rich
2     AUS 2002 middle
3     GRC 2003   poor
4     USA 2004   rich
5     AUS 2005 middle
6     GRC 2006   poor
Jaap
  • 81,064
  • 34
  • 182
  • 193
msh855
  • 1,493
  • 1
  • 15
  • 36
  • Related: [*Replacing NAs with latest non-NA value*](https://stackoverflow.com/questions/7735647/replacing-nas-with-latest-non-na-value) – Jaap Dec 21 '17 at 21:18
  • Also related: [_Replace missing values (NA) with most recent non-NA by group_](https://stackoverflow.com/q/23340150/4497050) – alistaire Dec 22 '17 at 21:03

6 Answers6

13

In base R, you could use ave():

transform(df, Level = ave(Level, Country, FUN = na.omit))

#   Country Year  Level
# 1     USA 2001   rich
# 2     AUS 2002 middle
# 3     GRC 2003   poor
# 4     USA 2004   rich
# 5     AUS 2005 middle
# 6     GRC 2006   poor

Another, more accurate possibility is to use a join. Here we merge the Country column with the NA-omitted data. The outcome is the same, just in a different row order.

merge(df["Country"], na.omit(df))

#   Country Year  Level
# 1     AUS 2002 middle
# 2     AUS 2002 middle
# 3     GRC 2003   poor
# 4     GRC 2003   poor
# 5     USA 2001   rich
# 6     USA 2001   rich
Rich Scriven
  • 97,041
  • 11
  • 181
  • 245
  • I get this error message "Error in x[i] <- value[[j]] : replacement has length zero" when I apply it to my own dataframe – msh855 Dec 24 '17 at 10:29
9

We can group by 'Country' and get the non-NA unique value

library(dplyr)
df %>%
    group_by(Country) %>% 
    dplyr::mutate(Level = Level[!is.na(Level)][1])
# A tibble: 6 x 3
# Groups:   Country [3]
#  Country  Year  Level
#   <fctr> <int> <fctr>
#1     USA  2001   rich
#2     AUS  2002 middle
#3     GRC  2003   poor
#4     USA  2004   rich
#5     AUS  2005 middle
#6     GRC  2006   poor

If we have loaded dplyr along with plyr, it is better to specify explicitly dplyr::mutate or dplyr::summarise so that it uses the function from dplyr. There are same functions in plyr and it could potentially mask the functions from dplyr when both are loaded creating different behavior.

akrun
  • 874,273
  • 37
  • 540
  • 662
  • 3
    Nice thought. Good job. But aren't we not making an assumption here there will always be a non na unique value? Just trying to understand.. – Abdul Rahman K Dec 21 '17 at 18:57
  • @AbdulRahmanK Thanks, It seems like the OP have a single unique value. If there are multiple, values, perhaps `mutate(Level = toString(unique(Level[!is.na(Level)])))` – akrun Dec 21 '17 at 19:39
  • @akrun , I think your answer did not work. Even for the MWE I provided the outcome was not the expected one. because of this [1] and maps all countries to "rich". Tried to find a solution, but nothing worked so far. – msh855 Dec 24 '17 at 00:02
  • @msh855 It is not clear. Based on the example, it is working – akrun Dec 24 '17 at 05:25
  • @akrun I don't know ! but when I use my example with your suggested code, the third column 'Level' is filled with only the value/level 'rich'. So, it is like replacing all values with this. – msh855 Dec 24 '17 at 07:41
  • @msh855 As I mentioned, your example is working fine with this. Is it possible that you also loaded `plyr`. Try with `df %>% group_by(Country) %>% dplyr::mutate(Level = Level[!is.na(Level)][1])` – akrun Dec 24 '17 at 07:42
  • @akrun, ok you are right. once dplyr is updated it worked. But, let me trying now to my own dataframe. You have any idea why by loading dplyr it worked ? – msh855 Dec 24 '17 at 07:46
  • 1
    @msh855 This is one of the problems of functions masking others. the package `plyr` have `summarise/mutate` etc which could mask this dplyr `summarise/mutate` and when you call this, it could be calling the `plyr` mutate instead. It is better to use `dplyr::` to specify that you wanted the function from that package. – akrun Dec 24 '17 at 07:47
  • @akrun finaly it worked. I re-accepted your answer. but, maybe you wish to update it making this point about dplyr more clear to everyone. Thank you for your help. – msh855 Dec 24 '17 at 08:09
8

You can do it using data.table and zoo:-

library(data.table)
library(zoo)
setDT(df)
df[, Level := na.locf(Level), by = Country]

This will give you:-

   Country Year  Level
1:     USA 2001   rich
2:     AUS 2002 middle
3:     GRC 2003   poor
4:     USA 2004   rich
5:     AUS 2005 middle
6:     GRC 2006   poor
sm925
  • 2,648
  • 1
  • 16
  • 28
5
library(dplyr)

df %>% 
  group_by(Country) %>% 
  mutate(Level = replace(Level, is.na(Level), unique(na.omit(Level))))
  Country  Year  Level
   <fctr> <int> <fctr>
1     USA  2001   rich
2     AUS  2002 middle
3     GRC  2003   poor
4     USA  2004   rich
5     AUS  2005 middle
6     GRC  2006   poor

Or, more succinctly, applying @suchait's idea to use na.locf:

df %>% 
  group_by(Country) %>% 
  mutate(Level = zoo::na.locf(Level))
eipi10
  • 91,525
  • 24
  • 209
  • 285
1

A solution using and .

library(dplyr)
library(tidyr)

df %>%
  arrange(Country) %>%
  fill(Level) %>%
  arrange(Year)
#   Country Year  Level
# 1     USA 2001   rich
# 2     AUS 2002 middle
# 3     GRC 2003   poor
# 4     USA 2004   rich
# 5     AUS 2005 middle
# 6     GRC 2006   poor
www
  • 38,575
  • 12
  • 48
  • 84
1

Here is another data.table solution which updates on join using a lookup table which is created from the given dataset itself:

library(data.table)
setDT(df)[df[!is.na(Level)], on = .(Country), Level := Level][]
   Country Year  Level
1:     USA 2001   rich
2:     AUS 2002 middle
3:     GRC 2003   poor
4:     USA 2004   rich
5:     AUS 2005 middle
6:     GRC 2006   poor
Uwe
  • 41,420
  • 11
  • 90
  • 134