How to impute missing observations in subsequent rows?

Question

I'm having difficulties with some recoding (filling in empty cells in R or SPSS)

I'm working with a long format data-set (in order to run a multilevel model) where each respondent (ID-variable) has three rows, so three times the same ID number below each other (for three different momemts in time).

The problem is that for a second variable (ancestry of respondent) only the first row has a value but the two second rows for each respondent misses that (same) value (0/1). Can any one help? I'm only used to recoding within the same row... below a the data format.

ID      Ancestry    
1003    1
1003    .
1003    .
1004    0
1004    .
1004    .
1005    1
1005    .
1005    .

score 4 · Accepted Answer · answered May 08 '16 at 12:04

We can use na.locf assuming that . implies NA values.

 library(zoo)
 df1$Ancestry <- na.locf(df1$Ancestry)

If the column is non-numeric i.e. have . as values, then we need to convert it to numeric so that the . coerce to NA and then we apply na.locf on it

 df1$Ancestry <- na.locf(as.numeric(df1$Ancestry))
 df1$Ancestry
 #[1] 1 1 1 0 0 0 1 1 1

If it needs to be grouped by "ID"

 library(data.table)
 setDT(df1)[, Ancestry := na.locf(Ancestry), by = ID]

score 2 · Answer 2 · answered May 08 '16 at 14:16

In SPSS this should do the job, assuming the "Ancestry" variable is numeric:

AGGREGATE /OUTFILE=* MODE=ADDVARIABLES OVERWRITEVARS=YES/BREAK=ID /Ancestry=MAX(Ancestry).

If "Ancestry" is a string, you could go this way:

sort cases by ID Ancestry (d).
if ID=lag(ID) and Ancestry="" Ancestry=lag(Ancestry).
execute.

coffeinjunky · Answer 3 · 2016-05-08T15:23:13.263

Another easy way of achieving this is in R the following, using the fact that the actual value always occurs in the first position for each ID:

library(dplyr)
df %>% group_by(ID) %>% mutate(Ancestry = Ancestry[1])

Source: local data frame [9 x 2]
Groups: ID [3]

     ID Ancestry
  (int)    (chr)
1  1003        1
2  1003        1
3  1003        1
4  1004        0
5  1004        0
6  1004        0
7  1005        1
8  1005        1
9  1005        1

If you prefer a base solution, I think what I would probably have done is the following, though there are many ways of achieving the same: First, note that if df is your dataframe, then

 df$Ancestry <- as.numeric(df$Ancestry)

will coerce the . into NA. Then we could use

df_id <- df[complete.cases(df),]
df$Ancestry <- NULL
df <- merge(df, df_id, all.x = T)

which gives the same output. Here, I take a dataframe that consists only of complete entries, and merge it back onto the original dataframe.

score 2 · Answer 4 · answered May 08 '16 at 21:17

Once you convert the .s to NA by your favorite method, this is exactly what tidyr::fill was designed to do:

library(tidyr)

df %>% extract(Ancestry, 'Ancestry', convert = TRUE) %>% fill(Ancestry)
# 
#     ID Ancestry
# 1 1003        1
# 2 1003        1
# 3 1003        1
# 4 1004        0
# 5 1004        0
# 6 1004        0
# 7 1005        1
# 8 1005        1
# 9 1005        1

score 2 · Answer 5 · edited May 15 '16 at 06:37

2

IF (ID EQ LAG(ID)) Ancestry=LAG(Ancestry).

Or alternatively:

IF (ID EQ LAG(ID) AND MISSING(Ancestry)) Ancestry=LAG(Ancestry).

edited May 15 '16 at 06:37

eli-k

10,898
11
40
44

answered May 09 '16 at 16:54

David Marso

71
3

How to impute missing observations in subsequent rows?

5 Answers5