Replace NA with condition

Question

I am actually running econometrics analysis. I encounter problem in this analysis. I am using Rstudio.

My Database is composed of 1408 (704 for type 1 and 704 for type 2) observations and 49 variables.

Gender    Period   Matching group   Group  Type  Overcharging
1           1            73            1       1    NA
0           2            73            1       1    NA
1           1            77            2       1    NA
1           2            77            2       1    NA
...        ...          ...           ...     ...   ...
0           1            73            1       2    1
0           2            73            1       2    0
1           1            77            2       2    0
1           2            77            2       2    1
...        ...          ...           ...     ...   ...

You can see that NA values are correlated with type of the agent (if agent is type 1). What I'd like to do is : if agents of type 1 belong to the same matching group, group and period of agents type 2, then replace NA by the same value of the agent of the type 2 (for each row).

Expected output     
Gender    Period   Matching group   Group  Type  Overcharging
1           1            73            1       1    1
0           2            73            1       1    0
1           1            77            2       1    0
1           2            77            2       1    1
0           1            73            1       2    1
0           2            73            1       2    0
1           1            77            2       2    0
1           2            77            2       2    1

[Please make your example reproducable.](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) — jogo, May 21 '17 at 15:28

jogo · Accepted Answer · 2017-05-21T10:44:18.147

Here is a solution with data.table:

library("data.table")
dt <- fread(header=TRUE,
'Gender    Period   Matching.group   Group  Type  Overcharging
1           1            73            1       1    NA
0           2            73            1       1    NA
1           1            77            2       1    NA
1           2            77            2       1    NA
0           1            73            1       2    1
0           2            73            1       2    0
1           1            77            2       2    0
1           2            77            2       2    1')

d2 <- dt[Type!=1, Overcharging, .(Group,Period)]
rbind(dt[Type==1][d2, on=.(Group, Period), Overcharging:=i.Overcharging],dt[Type!=1])

# > rbind(dt[Type==1][d2, on=.(Group, Period), Overcharging:=i.Overcharging],dt[Type!=1])
#    Gender Period Matching.group Group Type Overcharging
# 1:      1      1             73     1    1            1
# 2:      0      2             73     1    1            0
# 3:      1      1             77     2    1            0
# 4:      1      2             77     2    1            1
# 5:      0      1             73     1    2            1
# 6:      0      2             73     1    2            0
# 7:      1      1             77     2    2            0
# 8:      1      2             77     2    2            1

Eventually you can do in your special case:

dt[Type==1, Overcharging:=dt[Type!=1, Overcharging]]

(if the order of Group and Period for Type!=1 is the same as for Type==1)

Thanks for your answer. I don't understand something in the code (yes I am really a beginner !). To what corresponds the "." before .(Group, Period) ? — Marc, May 21 '17 at 10:51
`.()`is a short for `list()` if you are using the package `data.table`. — jogo, May 21 '17 at 11:27
Sorry but I have an error message : "Error in if (drop) warningc("drop ignored") : argument is not interpretable as logical In addition: Warning message: In if (drop) warningc("drop ignored") : the condition has length > 1 and only the first element will be used". Do you know what it means ? — Marc, May 21 '17 at 12:59
My code is running without any error. Eventually you are using other data. Please edit your question to show your data using `dput()`. — jogo, May 21 '17 at 13:24
I will post an other question asking for another way how the problem can be also solved, because your solution is not running on my computer (it is my fault and I don't have sufficient background to solve this problem...). Thank you :) ! — Marc, May 21 '17 at 14:57

www · Answer 2 · 2017-05-21T11:31:59.197

We can use functions from dplyr and tidyr (from tidyverse) for such task. The fill function from tidyr can impute the missing values based on the previous or the next row. So the idea is to arrange the data frame first and use fill to impute all NA in the Overcharging column.

library(tidyverse)

dt <- read.csv(text = "Gender,Period,Matching.group,Group,Type,Overcharging
1,1,73,1,1,NA
0,2,73,1,1,NA
1,1,77,2,1,NA
1,2,77,2,1,NA
0,1,73,1,2,1
0,2,73,1,2,0
1,1,77,2,2,0
1,2,77,2,2,1",
               stringsAsFactors = FALSE)

dt2 <- dt %>%
  mutate(ID = 1:n()) %>%                             # Create a column with ID starting 1
  arrange(Period, `Matching.group`, Group, Type) %>% # Arrange the columns
  fill(Overcharging, .direction = c("up")) %>%       # Fill the missing values, the direction is "up"
  arrange(ID) %>%                                    # Arrange the columns based on ID
  select(-ID)                                        # Remove the ID column

Replace NA with condition

2 Answers2