2

I'm having trouble with something that must be quite easy in R; I want to fill the missing values in a column (of a data.frame) with the corresponding values. So like this:

V1    V2  
cat   tree            
cat   NA    
NA    tree  
dog   house      
NA    house    
dog   NA   
horse NA  
NA    car  
horse car

So the corresponding string of cat is tree, so "tree" must be filled in when there is a NA in the "cat group". "house" must be filled in when there is a NA in the "dog group" (so I must choose to take the first word of the list at 1 and 2 as the "leading" word to fill in at every number - EDIT --> it is better when the first is not leading in case of a NA is first).

There are a lot of NA's in V1, and a few in V2, and I want to fill only the NA's of V2.

In SPSS its done with the aggregate function, but I dont think the aggregate function in R is comparable in this case, or is it? Anyone knows how to do this?

Thanks!

Uwe
  • 41,420
  • 11
  • 90
  • 134
W. Mooi
  • 119
  • 1
  • 3
  • 11
  • is there always exactly one distinct non-NA value in the second column for each unique value in the first column? – talat Apr 20 '17 at 13:57
  • Yes, so in this example, "tree" always corresponds with 1 – W. Mooi Apr 20 '17 at 14:02
  • Possible duplicate of [How do I replace NA values with zeros in an R dataframe?](http://stackoverflow.com/questions/8161836/how-do-i-replace-na-values-with-zeros-in-an-r-dataframe) – Single Entity Apr 20 '17 at 14:49
  • 2
    Try this: `mydata$V3 <- ave(mydata$V2, mydata$V1, FUN = function(x) na.omit(x)[1])` – talat Apr 20 '17 at 15:01
  • 2
    @SingleEntity, that link is not a duplicate since in this post they want different replacements per group – talat Apr 20 '17 at 15:02
  • @Docendo discimus, this works! Thank you very much!! – W. Mooi Apr 21 '17 at 09:03
  • This is quite a substantial edit to your initial data set invalidating most of the answers :-(. – Uwe Apr 21 '17 at 10:08
  • What about the first row with `V1`== "horse" should `V2` be filled in by "car" or should it stay NA? – Uwe Apr 21 '17 at 10:14
  • @ Uwe Block - Yeah, I also realise that now I've tried your code, sorry. Thanks a lot for your help anyway! – W. Mooi Apr 21 '17 at 10:15
  • @UweBlock - Yes, all the NA's of V2 should be filled (except when there are only NA's on both V1 and V2) – W. Mooi Apr 21 '17 at 10:17

3 Answers3

4

The OP has requested that the missing values need to be filled in by group. So, the zoo::na.locf() approach might fail here.

There is a method called update join which can be used to fill in the missing values per group:

library(data.table)   # version 1.10.4 used
setDT(DT)
DT[DT[!is.na(V1)][order(V2), .(fillin = first(V2)), by = V1], on = "V1", V2 := fillin][]
#    V1    V2
# 1:  1  tree
# 2:  1  tree
# 3:  1  tree
# 4:  2 house
# 5:  2 house
# 6:  2 house
# 7:  3  lawn
# 8:  3  lawn
# 9:  4    NA
#10:  4    NA
#11: NA    NA
#12: NA  tree

Note that the input data have been supplemented to cover some corner cases.

Explanation

The approach consists of two steps. First, the values to be filled in by group are determined followed by the update join which modifies DT in place.

fill_by_group <- DT[!is.na(V1)][order(V2), .(fillin = first(V2)), by = V1]
fill_by_group
#   V1 fillin
#1:  2  house
#2:  3   lawn
#3:  1   tree
#4:  4     NA

DT[fill_by_group, on = "V1", V2 := fillin][]

order(V2) ensures that any NA values are sorted last, so that first(V2) picks the correct value to fill in.

The update join approach has been benchmarked as the fastest method in another case.

Variant using na.omit()

docendo discimus has suggested in his comment to use na.omit(). This can be utilized for the update join as well replacing order()/first():

DT[DT[!is.na(V1), .(fillin = na.omit(V2)), by = V1], on = "V1", V2 := fillin][]

Note that na.omit(V2) works as well as na.omit(V2)[1] or first(na.omit(V2)), here.

Data

Edit: The OP has changed his originally posted data set substantially. As a quick fix, I've updated the sample data below to include cases where V1 is NA.

library(data.table)
DT <- fread(
"1 tree
1 NA
1 tree
2 house
2 house
2 NA
3 NA
3 lawn
4 NA
4 NA
NA NA
NA tree")

Note that the data given by the OP have been supplemented to cover three additional cases:

  • The first V2 value in each group is NA.
  • All V2 values in a group are NA.
  • V1 is `NA.
Community
  • 1
  • 1
Uwe
  • 41,420
  • 11
  • 90
  • 134
  • Is `order` necessary? Is it to ensure that NA go to the bottom for the fillin to work? – Sotos Apr 21 '17 at 06:36
  • @Sotos Yes, `order` is necessary if `first` is used. I have added a variant using `na.omit`. – Uwe Apr 21 '17 at 06:57
  • ahh ok. So you are basically grabbing `v2[1]` so `order` makes sure it is not an NA. So `fillin` doesn't work the same as `na.locf` or `tidyr::fill`? – Sotos Apr 21 '17 at 07:07
  • @Sotos The OP has requested to fill in missing values _by group_ (with the implied assumption that each group conists only of one distinct value or NA) – Uwe Apr 21 '17 at 07:09
  • For some reason, this doesn't fully work (maybe I am doing it wrong?). This code treats NA's as 1 group, so every NA is filled with "tree" for example... the extra code with na.omit(V2)) also doesn't work (error: "Error in vecseq(f__, len__, if (allow.cartesian || notjoin || !anyDuplicated(f__, : Join results in more than 2^31 rows (internal vecseq reached physical limit). Very likely misspecified join etc etc). The answer of docendo discimus works perfect: mydata$V3 <- ave(mydata$V2, mydata$V1, FUN = function(x) na.omit(x)[1]) – W. Mooi Apr 21 '17 at 09:06
  • @W.Mooi That's strange. Potentials problems or ideas how to track down the issue: Do the data supplied in my answer work for you? Which version of `data.table` do you have installed? What does `str()` return for your data? – Uwe Apr 21 '17 at 09:21
  • @ Uwe Block - yes, the data of your example works. I installed version 1.10.4 of data.table. The v1 and v2 in my data are both characters, so maybe that is the cause of the problem? I also only want to fill the NA's in v2 - in this code, it also fills the non-NA v2 cells when the v1 cell is also NA. So maybe my example was not really helpful; I'm gonna edit my OP – W. Mooi Apr 21 '17 at 09:46
  • @W.Mooi If there are NAs in V1 that might change the game. – Uwe Apr 21 '17 at 09:57
  • @W.Mooi I've quickly update my answer to cover cases where `V1` is `NA`. The solution should work also with your new data set. – Uwe Apr 21 '17 at 10:31
  • @Uwe, why not just use `df[!is.na(V1), V2 := first(na.omit(V2)), by = V1]` instead of a join? – talat Apr 21 '17 at 10:44
3

you can use dplyr and try:

mydata %>% 
  group_by(V1) %>%
  mutate(V2 = unique(V2[!is.na(V2)]))
Aramis7d
  • 2,444
  • 19
  • 25
-1

You can use below:

mydata<-read.table(text="1 tree
1 NA
1 tree
2 house
2 house
2 NA")

mydata[is.na(mydata$V2),]$V2<-mydata[which(is.na(mydata$V2))-1,]$V2
Erdem Akkas
  • 2,062
  • 10
  • 15
  • 1
    This approach works only for the given data set. It fails when the data include groups which start with `NA`. – Uwe Apr 21 '17 at 06:35
  • without a loop, this also fails for data with consecutive more than one `NA` within a group – Aramis7d Apr 21 '17 at 07:21