0

I am trying to replace the blank (missing) zipcodes in the df table with the zipcodes in another table called zipless, based on names. What would be the best approach? A for loop is probably very slow.

I was trying with something like this, but it does not work.

df$zip_new <- ifelse(df, is.na(zip_new),
                     left_join(df,zipless, by = c("contbr_nm" = "contbr_nm")),
                     zip_new)

I was able to make it work using this approach, but I am sure it is not the best one. I first added a new column from the lookup table and in the next step selectively used it, where necessary.

library(dplyr)
#temporarly renaming the lookup column in the lookup table
zipless <- plyr::rename(zipless, c("zip_new"="zip_new_temp"))
#adding the lookup column to the main table
df <- left_join(df, zipless, by = c("contbr_nm" = "contbr_nm"))
#taking over the value from the lookup column zip_new_temp if the condition is met, else, do nothing.
df$zip_new  <- ifelse((df$zip_new == "") &
                              (df$contbr_nm %in% zipless$contbr_nm), 
                            df$zip_new_temp,
                            df$zip_new)

What would be a proper way to do this?

Thank you very much!

Trgovec
  • 555
  • 3
  • 7
  • 16
  • 1
    [Reproducible data](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example), please. Even a couple of toy data sets. You can use `dput` on some of yours or create some that illustrate the problem. Also include your desired output from these. As it stands, your question is too vague. Further, include the names of packages that you are using: `left_join` is not a base R function. – lmo Aug 13 '17 at 22:13
  • 1
    Use the coalesce() function, ie. df$zip_new <- coalesce(df$zip_new, df$zip_new_temp) – Josh Gilfillan Aug 13 '17 at 22:16
  • 1
    note that `coalesce` is in the `dplyr` package. – Ben Bolker Aug 13 '17 at 22:22

2 Answers2

3

I'd suggest using match to just grab the zips you need. Something like:

miss_zips = is.na(df$zip_new)
df$zip_new[miss_zips] = zipless$zip_new[match(
    df$contbr_nm[miss_zips], 
    zipless$contbr_nm
  )]

Without sample data I'm not wholly sure of your column names, but something like that should work.

Gregor Thomas
  • 136,190
  • 20
  • 167
  • 294
1

I can only recommend the data.table-package for things like these. But your general approach is correct. The data.table-package has a much nicer syntax and is designed to handle large data sets.

In data.table it would probably look like this:

zipcodes <- data.table(left_join(df, zipless, by = "contbr_nm"))
zipcodes[, zip_new := ifelse(is.na(zip_new), zip_new_temp, zip_new)]
guscht
  • 843
  • 4
  • 20
  • Is `left_join` from the `dplyr` package? – www Aug 13 '17 at 22:40
  • yes it is, sorry for not mentioning above. Here you go: https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf – Trgovec Aug 13 '17 at 22:41
  • For a `data.table` answer, you might as well use `data.table` for the join, you can use `zipcodes <- merge(df, zipless, by = "contbr_nm", all.x = T)` or [as suggested here](https://stackoverflow.com/a/34600831/903061) – Gregor Thomas Aug 14 '17 at 07:18