3

I'm having difficulties with some recoding (filling in empty cells in R or SPSS)

I'm working with a long format data-set (in order to run a multilevel model) where each respondent (ID-variable) has three rows, so three times the same ID number below each other (for three different momemts in time).

The problem is that for a second variable (ancestry of respondent) only the first row has a value but the two second rows for each respondent misses that (same) value (0/1). Can any one help? I'm only used to recoding within the same row... below a the data format.

ID      Ancestry    
1003    1
1003    .
1003    .
1004    0
1004    .
1004    .
1005    1
1005    .
1005    .
user20650
  • 24,654
  • 5
  • 56
  • 91
Benjamin Telkamp
  • 1,451
  • 2
  • 17
  • 31

5 Answers5

4

We can use na.locf assuming that . implies NA values.

 library(zoo)
 df1$Ancestry <- na.locf(df1$Ancestry)

If the column is non-numeric i.e. have . as values, then we need to convert it to numeric so that the . coerce to NA and then we apply na.locf on it

 df1$Ancestry <- na.locf(as.numeric(df1$Ancestry))
 df1$Ancestry
 #[1] 1 1 1 0 0 0 1 1 1

If it needs to be grouped by "ID"

 library(data.table)
 setDT(df1)[, Ancestry := na.locf(Ancestry), by = ID]
akrun
  • 874,273
  • 37
  • 540
  • 662
2

In SPSS this should do the job, assuming the "Ancestry" variable is numeric:

AGGREGATE /OUTFILE=* MODE=ADDVARIABLES OVERWRITEVARS=YES/BREAK=ID /Ancestry=MAX(Ancestry).

If "Ancestry" is a string, you could go this way:

sort cases by ID Ancestry (d).
if ID=lag(ID) and Ancestry="" Ancestry=lag(Ancestry).
execute.
eli-k
  • 10,898
  • 11
  • 40
  • 44
2

Another easy way of achieving this is in R the following, using the fact that the actual value always occurs in the first position for each ID:

library(dplyr)
df %>% group_by(ID) %>% mutate(Ancestry = Ancestry[1])

Source: local data frame [9 x 2]
Groups: ID [3]

     ID Ancestry
  (int)    (chr)
1  1003        1
2  1003        1
3  1003        1
4  1004        0
5  1004        0
6  1004        0
7  1005        1
8  1005        1
9  1005        1

If you prefer a base solution, I think what I would probably have done is the following, though there are many ways of achieving the same: First, note that if df is your dataframe, then

 df$Ancestry <- as.numeric(df$Ancestry)

will coerce the . into NA. Then we could use

df_id <- df[complete.cases(df),]
df$Ancestry <- NULL
df <- merge(df, df_id, all.x = T)

which gives the same output. Here, I take a dataframe that consists only of complete entries, and merge it back onto the original dataframe.

coffeinjunky
  • 11,254
  • 39
  • 57
2

Once you convert the .s to NA by your favorite method, this is exactly what tidyr::fill was designed to do:

library(tidyr)

df %>% extract(Ancestry, 'Ancestry', convert = TRUE) %>% fill(Ancestry)
# 
#     ID Ancestry
# 1 1003        1
# 2 1003        1
# 3 1003        1
# 4 1004        0
# 5 1004        0
# 6 1004        0
# 7 1005        1
# 8 1005        1
# 9 1005        1
alistaire
  • 42,459
  • 4
  • 77
  • 117
2
IF (ID EQ LAG(ID)) Ancestry=LAG(Ancestry).

Or alternatively:

IF (ID EQ LAG(ID) AND MISSING(Ancestry)) Ancestry=LAG(Ancestry).
eli-k
  • 10,898
  • 11
  • 40
  • 44