1

I have a dataset that contains features Date, Age and Customer_ID. Some of the rows in Age have missing values (NAs) in them and I would like to impute them.

Here is some example data:

Date <- c("201101", "201102", "201101", "201102", "201103")
Age <- c("12-17", "12-17", "30-35", NA, NA)
Customer_ID <- c("1234", "1234", "5678", "5678", "5678")
df <- data.frame(Date, Age, Customer_ID)

Date      Age      Customer_ID
201101    12-17    1234
201102    12-17    1234
201101    30-35    5678
201102    NA       5678
201103    NA       5678

I would like to replace the NAs in Age with 30-35.

So for all NAs, it has to check whether there is another row with the same Customer_ID and replace the NA with the value for Age stated in the other row.

Any ideas on how to do this? Thanks.

kiae
  • 169
  • 1
  • 13

2 Answers2

1

You can use the fill function from tidyr. It is a tidyr function for last observation carried forward, i.e, fill NA values with previous non-NA value. In order for this to work, you can use arrange to sort the column2 which sorts the NA values behind non-NAs and then you can group by customer ID and fill the Age column:

library(dplyr)
library(tidyr)
df %>% arrange(Age) %>% group_by(Customer_ID) %>% fill(Age)

# Source: local data frame [5 x 3]
# Groups: Customer_ID [2]

#      Date    Age Customer_ID
#    <fctr>  <fctr>  <fctr>
# 1  201101   12-17    1234
# 2  201102   12-17    1234
# 3  201101   30-35    5678
# 4  201102   30-35    5678
# 5  201103   30-35    5678
Psidom
  • 209,562
  • 33
  • 339
  • 356
1

With base R:

lookup <- unique(df[!is.na(df$Age),][c('Customer_ID', 'Age')])
df[is.na(df$Age),]$Age <- lookup[match(df[is.na(df$Age),]$Customer_ID,
                                                lookup$Customer_ID),]$Age
Sandipan Dey
  • 21,482
  • 2
  • 51
  • 63