2

I have two dataframes both of which contain the same variable and a unique id for each observation.

df.1 is a large dataset which contains missing values denoted by an NA. The values for those missing entries are contained within df.2 and I would like to replace the missings from df.1 with the values from df.2 by matching the id's.

I haven't been able to find a similar issue on here which takes into account them both being factor variables.

To make things simpler: if the id's match - the missing value from df.1 should be replaced with the factor value from df.2.

df.1 <- data.frame(id = c(334,440,501,2304,2500), 
                v1 = c("4 dogs",NA,"3 dogs",NA,"No dogs"))

df.2 <- data.frame(id = c(440,2304), 
                v2 = c("4 dogs","5 dogs"))

Your help is much appreciated.

dayleymart
  • 23
  • 8
  • Did you check these, I think this has been answered. [here](https://stackoverflow.com/questions/5965698/merge-unequal-dataframes-and-replace-missing-rows-with-0) and [here](https://stackoverflow.com/questions/25539326/filling-in-columns-with-matching-ids-from-two-dataframes-in-r) – user5249203 Jan 03 '18 at 18:09
  • Unfortunately that isn't the same case. i'd like the code to simply match ids between the two variables and replace the values from df.2 to df.1. The ids only specify values which are missing in df.1. – dayleymart Jan 03 '18 at 18:13
  • Why do you need to keep them as both factor variables? Just convert to `character`. – Gregor Thomas Jan 03 '18 at 18:16
  • `within(merge(df.1, df.2, all = TRUE), { V1 = pmax(as.character(v1), as.character(v2), na.rm = TRUE) ; rm(v1, v2)})`.... – A5C1D2H2I1M1N2O1R2T1 Jan 03 '18 at 18:17
  • Ultimately I'd like all the data to be in factor form at the end as I would like to do further analysis using them in that class... – dayleymart Jan 03 '18 at 18:33
  • tricky question. – MKR Jan 03 '18 at 18:53
  • 1
    It's easy to convert back to factor at the end. – Gregor Thomas Jan 03 '18 at 18:58
  • Yes but would it retain the same levels as it had at the beginning? If it originally had a level but it wasn't used as a value in the observations, then turning it into a character would mean it loses that level when its finally converted to a factor again? – dayleymart Jan 03 '18 at 19:00
  • @Gregor I would have preferred converting at end too. – MKR Jan 03 '18 at 19:00
  • So save the beginning levels! `init_levels = levels(df.1$v1) ... ... result$v1 = factor(result$v1, levels = init_levels)` – Gregor Thomas Jan 03 '18 at 19:05
  • @Gregor Yep I understand that and thank you for the code. I was just worried that since its a large amount of survey data, that any other meta-data information could have been lost from it similar to those levels? – dayleymart Jan 03 '18 at 19:08

4 Answers4

2

As @Gregor mentioned you can to convert df back to factors. The handy function here is coalesce function by @MrFlick. The solution is self-explanatory

library(dplyr)

df.1 %>%
  left_join(df.2, by = "id") %>%
  mutate_if(is.factor, as.character) %>%
  mutate(final = coalesce(v1, v2))  %>% mutate_if(is.character, as.factor)

output

   id      v1     v2   final
1  334  4 dogs   <NA>  4 dogs
2  440    <NA> 4 dogs  4 dogs
3  501  3 dogs   <NA>  3 dogs
4 2304    <NA> 5 dogs  5 dogs
5 2500 No dogs   <NA> No dogs

store the above result in a variable(df), then check the str(df)

'data.frame':   5 obs. of  4 variables:
 $ id   : num  334 440 501 2304 2500
 $ v1   : Factor w/ 3 levels "3 dogs","4 dogs",..: 2 NA 1 NA 3
 $ v2   : Factor w/ 2 levels "4 dogs","5 dogs": NA 1 NA 2 NA
 $ final: Factor w/ 4 levels "3 dogs","4 dogs",..: 2 2 1 3 4

if you want to drop the v1 and v2 columns, just pipe the final result to %>% select(id,final)

Hope it works.

user5249203
  • 4,436
  • 1
  • 19
  • 45
0

You can join df.1 and df.2 to keep both v1 and v2 in the merged data.frame. The run logic to replace missing v1 with value of v2.

library(dplyr)

df.1 <- data.frame(id = c(334,440,501,2304,2500), 
                   v1 = c("4 dogs",NA,"3 dogs",NA,"No dogs"))

df.2 <- data.frame(id = c(440,2304), 
                   v2 = c("4 dogs","5 dogs"))
#merge using left_join to keep all rows from df.1
final <- df.1 %>%
  left_join(df.2, by = "id")
#> final
#    id      v1     v2
#1  334  4 dogs   <NA>
#2  440    <NA> 4 dogs
#3  501  3 dogs   <NA>
#4 2304    <NA> 5 dogs
#5 2500 No dogs   <NA>

#Define a function to replace missing v1
replMissing <- function(x, y){
  ifelse(is.na(x), y, x )
}

#call replMissing function using mapply. Modified to handle factor
final$v1 <- as.factor(mapply(replMissing, as.character(final$v1), as.character(final$v2)))

#results is
#> final
#    id      v1     v2
#1  334  4 dogs   <NA>
#2  440  4 dogs 4 dogs
#3  501  3 dogs   <NA>
#4 2304  5 dogs 5 dogs
#5 2500 No dogs   <NA>

The v2 column can be dropped now

MKR
  • 19,739
  • 4
  • 23
  • 33
  • Is there possible solution to then change them back to a factor and retain all the levels it originally had? – dayleymart Jan 03 '18 at 18:40
  • @dayleymart modified solution to handle `factor` value. Actually while calling `mapply` conversion needs to be done. – MKR Jan 03 '18 at 18:52
0

Using data.table and dplyr:-

library(data.table)
library(dplyr)
df <- left_join(df.1, df.2, by = "id")
setDT(df)
df[is.na(v1), v1 := v2]
df[, v2 := NULL]

You'll get the desired output:-

     id      v1
1:  334  4 dogs
2:  440  4 dogs
3:  501  3 dogs
4: 2304  5 dogs
5: 2500 No dogs

Till this point id would be numeric and v1 would be factor. If you want id also to be converted to factor. You can do that using:-

df[, id := as.factor(id)]
sm925
  • 2,648
  • 1
  • 16
  • 28
  • Would the variable lose its factor type using this method? – dayleymart Jan 03 '18 at 18:40
  • So, `v1` would be factor. `id` would be numeric. If you want to make `id` again factor. You can use `df[, id := as.factor(id)]`. I'll add it in answer, if you want. – sm925 Jan 03 '18 at 18:46
  • Apologies, ID was always numeric anyway and can remain numeric as its simply a Unique identifier. Thanks. – dayleymart Jan 03 '18 at 18:49
  • Cool, anyways I added it. If you need you can use it. – sm925 Jan 03 '18 at 18:50
  • Having just used your code and taken a look at a summary of the newly created v1 variable. It seems 'NA' has been added as a factor for all the missing values in df1 which werent replaced...do you have a solution to turn them back into just missing values? – dayleymart Jan 03 '18 at 21:06
  • I guess this is how it was when you created `df.1`. NA were coming as factors. – sm925 Jan 03 '18 at 21:23
0

Using tidyverse approach you have two solutions:

First solution:

library(dplyr)
df.1 <- data.frame(id = c(334,440,501,2304,2500), 
                   v1 = c("4 dogs",NA,"3 dogs",NA,"No dogs"),stringsAsFactors=F) 

df.2 <- data.frame(id = c(440,2304), 
                   v2 = c("4 dogs","5 dogs"),stringsAsFactors=F) %>% 
    rename(v1=v2)

df_mix <- bind_rows(df.1,df.2) %>% 
    drop_na(...=v1)

Second solution:

df.1 <- data.frame(id = c(334,440,501,2304,2500), 
                   v1 = c("4 dogs",NA,"3 dogs",NA,"No dogs"),stringsAsFactors=F)

df.2 <- data.frame(id = c(440,2304), 
                   v2 = c("4 dogs","5 dogs"),stringsAsFactors=F) 

df_mix <- left_join(df.1,df.2,by="id") %>% 
    mutate(v1=case_when(
        is.na(v1) ~ v2,
        !is.na(v1) ~ v1
    )) %>% 
    select(1:2)

PS: I prefer to laod strings as character vectors

Scipione Sarlo
  • 1,470
  • 1
  • 17
  • 31