1

I have a very simple question. All I can find however are very complicated answers that do not do exactly what I need.

What came closest, I found here:

Answer by flodel and eddi (data.table)

I would however like to additionally specify how to treat the NA's in the specified column based on the value in a different column.

I have a data.table which has columns with NA's, where fac is a factor variable.

df <- fread(
  "A   B   C   fac   H   I   J   iso   year   matchcode
     0   1   1   NA   0   1   0   NLD   2009   NLD2009
     1   0   0   NA   1   0   1   NLD   2014   NLD2014
     0   0   0   B   1   0   0   AUS   2011   AUS2011
     1   0   1   B   0   1   0   AUS   2007   AUS2007
     0   1   0   NA  0   1   1   USA   2007   USA2007
     0   0   1   NA  0   0   1   USA   2011   USA2010
     0   1   0   NA  0   0   0   USA   2013   USA2013
     1   0   1   A   0   1   0   BLG   2007   BLG2007
     0   1   0   A   1   0   1   BEL   2009   BEL2009
     1   0   1   A   0   1   0   BEL   2012   BEL2012",
  header = TRUE
)

What I would like to do is to assign the values D and E to the NA's in column fac based on the values in iso3c. So when iso3c == NLD, the NA's in fac should be replaced by D and when iso3c == USA the NA's in fac should be replaced by E, leading to the following result.

df <- fread(
  "A   B   C   fac   H   I   J   iso   year   matchcode
     0   1   1   D   0   1   0   NLD   2009   NLD2009
     1   0   0   D   1   0   1   NLD   2014   NLD2014
     0   0   0   B   1   0   0   AUS   2011   AUS2011
     1   0   1   B   0   1   0   AUS   2007   AUS2007
     0   1   0   E  0   1   1   USA   2007   USA2007
     0   0   1   E  0   0   1   USA   2011   USA2010
     0   1   0   E  0   0   0   USA   2013   USA2013
     1   0   1   A   0   1   0   BLG   2007   BLG2007
     0   1   0   A   1   0   1   BEL   2009   BEL2009
     1   0   1   A   0   1   0   BEL   2012   BEL2012",
  header = TRUE
)

EDIT: The fact that fac is a factor variable gave some issues. What worked is the following:

df$fac<- as.character(df$fac)
df[, fac:= ifelse(is.na(fac) & iso3c == "NLD", "D", 
                   ifelse(is.na(fac) & iso3c == "USA", "E", wbgroup))][]
df[, fac:= factor(fac, levels = c(levels(fac), c('A', 'B', 'C', 'D', 'E', 'F', 'G')))]
Tom
  • 2,173
  • 1
  • 17
  • 44

2 Answers2

4

We need to specify the logical condition in i and do the assignment. As there are only two cases to change the value, it can be done in two steps

df[is.na(fac) & iso == 'NLD', fac := 'D'
  ][is.na(fac) & iso == 'USA', fac := 'E'][]
#    A B C fac H I J iso year matchcode
# 1: 0 1 1   D 0 1 0 NLD 2009   NLD2009
# 2: 1 0 0   D 1 0 1 NLD 2014   NLD2014
# 3: 0 0 0   B 1 0 0 AUS 2011   AUS2011
# 4: 1 0 1   B 0 1 0 AUS 2007   AUS2007
# 5: 0 1 0   E 0 1 1 USA 2007   USA2007
# 6: 0 0 1   E 0 0 1 USA 2011   USA2010
# 7: 0 1 0   E 0 0 0 USA 2013   USA2013
# 8: 1 0 1   A 0 1 0 BLG 2007   BLG2007
# 9: 0 1 0   A 1 0 1 BEL 2009   BEL2009
#10: 1 0 1   A 0 1 0 BEL 2012   BEL2012

If there are many values to be replaced, do a join with a key/value dataset and do the assignment

df[data.table(fac = NA_character_, iso = c('NLD', 'USA'), 
        val = c('D', 'E')), fac := val, on = .(fac, iso)]

NOTE: The columns fac, iso are character class. If fac is a factor class, and 'D', 'E' levels didn't exist in the column, then create the new levels before doing the assignment i.e.

df[, fac := factor(fac, levels = c(levels(fac), c('D', 'E')))]
akrun
  • 874,273
  • 37
  • 540
  • 662
  • Hey akrun, once again thank you for your answer. You have been helping me out a lot lately! The `fac` variable is indeed a factor. What if the categories `D` and `E` already exists? – Tom Jan 06 '19 at 02:23
  • @Tom If it already exists, then there is no issue because the levels already is there and you can directly apply the code. Otherwise, create new levels as in the last chunk of code. Should work – akrun Jan 06 '19 at 02:24
  • Hmm, I ran the code without your edit. And it ran without issue. It did however not change anything in the data.. – Tom Jan 06 '19 at 02:26
  • Are you saying that the NA remain the same after the assignment. In that case, check whether the `NA` elements are true `NA`s or some character `"NA"` level – akrun Jan 06 '19 at 02:27
  • @Tom Can you check `levels(df$fac)` – akrun Jan 06 '19 at 02:28
  • @Tom Also, you may need to check if there are leading/lagging spaces for 'NLD', 'USA' as in `'NLD '` – akrun Jan 06 '19 at 02:35
  • @Tom Suppose, I have a column `levels(factor(c('A', 'B', NA, 'A')))# [1] "A" "B"`. If the `level`s include `NA` as well, then there is an issue – akrun Jan 06 '19 at 02:48
  • The NA's were only showing when doing 'table(df$fac, useNA = "always")' and then they were showing up as ``, so not as levels, right?. I tried to first convert everything back to characters, apply www's answer and then redefine the levels. It's a bit cumbersome, but it worked (see edit). – Tom Jan 06 '19 at 02:53
  • @Tom Okay. Anyway, I tried to replicate the issue, but it is working fine for me. `df$fac <- factor(df$fac)` assuming it as `factor` column. Then created the `levels` `df[, fac := factor(fac, levels = c(levels(fac), c('D', 'E')))]` and do the assignment `df[is.na(fac) & iso == 'NLD', fac := 'D' ][is.na(fac) & iso == 'USA', fac := 'E'][]` – akrun Jan 06 '19 at 02:56
1

Another option using data.table with two ifelse statements.

library(data.table)

df[, fac := ifelse(is.na(fac) & iso == "NLD", "D", 
                   ifelse(is.na(fac) & iso == "USA", "E", fac))][]
#     A B C fac H I J iso year matchcode
#  1: 0 1 1   D 0 1 0 NLD 2009   NLD2009
#  2: 1 0 0   D 1 0 1 NLD 2014   NLD2014
#  3: 0 0 0   B 1 0 0 AUS 2011   AUS2011
#  4: 1 0 1   B 0 1 0 AUS 2007   AUS2007
#  5: 0 1 0   E 0 1 1 USA 2007   USA2007
#  6: 0 0 1   E 0 0 1 USA 2011   USA2010
#  7: 0 1 0   E 0 0 0 USA 2013   USA2013
#  8: 1 0 1   A 0 1 0 BLG 2007   BLG2007
#  9: 0 1 0   A 1 0 1 BEL 2009   BEL2009
# 10: 1 0 1   A 0 1 0 BEL 2012   BEL2012
www
  • 38,575
  • 12
  • 48
  • 84
  • Thank you so much for your answer. It works for the specified NA's, but the other factors are now all of a sudden converted to numbers: ` table(df$iso , useNA = "always":` `1 2 3 4 6 7 D E 10856 22264 17196 4539 8479 7598 550 307 0 ` Do I need to do something different because it concerns factor variables? – Tom Jan 06 '19 at 02:18
  • I did not know you have factor in your data table. At least not from your example. If you have factor, you can convert it to character first then run the code. – www Jan 06 '19 at 02:24