R dataframe, changing the value in a variable with a value from the same variable

Question

I have a Data Frame which looks like this

ID  Name  Surname  Country    Unique_number
1   John   Snow      UK          12345
1   John   Anderson  USA         53214
1   John   David     UK          NA
2   Kim    Snow      UK          62321     
2   Kim   Anderson   USA         77832
2   Kim    David     UK          NA

I want a data from which will look like this (Notice the change in unique_number )

 ID  Name  Surname  Country    Unique_number
1   John   Snow      UK          12345
1   John   Anderson  USA         53214
1   John   David     UK          12345
2   Kim    Snow      UK          62321     
2   Kim   Anderson   USA         77832
2   Kim    David     UK          62321

Can someone help with to do this in R studios ?

Thank you

do you want to replace the missing value with the last seen? — storaged, Sep 12 '18 at 19:42

konvas · Answer 1 · 2018-09-12T20:17:16.910

What is the value of df$Unique_number[2], is it an empty string? You can first convert it to NA

 df$Unique_number[df$Unique_number == ''] <- NA

and then use na.locf from the zoo package

df$Unique_number <- zoo::na.locf(df$Unique_number)

This will carry over the last non-NA observation to replace NAs.

Edit

To preseve original NA values, split your dataframe in two and operate only on the part that contains the values you want to replace (I am assuming empty strings)

df0 = df[is.na(df$Unique_number), ]
df1 = df[-is.na(df$Unique_number), ]

(alternatively use split(df, is.na(df$Unique_number)) and then call the code above on df1 and finally rbind them.

Edit 2

Here is another approach, I'm pretty sure it will be slower than the one above that uses zoo but lets you specify your own string

MISSING_STRING = '' # String you want replaced with last non-NA value
x0 <- c("1", "2", "", "3", "4", "", "", "5", "6", NA, "", "7", "8", 
"", "9", "10", "") # Example vector
x <- x0 # Store initial example vector for comparison at the end
missing.ids <- which(is.na(x) | x == MISSING_STRING)
replacement.ids <- missing.ids - 1
replacement.ids[1 + which(diff(replacement.ids) == 1)] <- replacement.ids[diff(replacement.ids) == 1]
na.ids <- is.na(x)
x[missing.ids] <- x[replacement.ids]
x[na.ids] <- NA
# Compare initial vs final value
cbind(x0, x)
    x0   x   
[1,] "1"  "1" 
[2,] "2"  "2" 
[3,] ""   "2" 
[4,] "3"  "3" 
[5,] "4"  "4" 
[6,] ""   "4" 
[7,] ""   "4" 
[8,] "5"  "5" 
[9,] "6"  "6" 
[10,] NA   NA  
[11,] ""   "6" 
[12,] "7"  "7" 
[13,] "8"  "8" 
[14,] ""   "8" 
[15,] "9"  "9" 
[16,] "10" "10"
[17,] ""   "10"

Thank you, But there is a problem . I cannot replace all the NA's with the previous values because i have other NA's which i want to keep as NA's. — Kp_Data, Sep 12 '18 at 19:47
My Ideal code would be if it could only replace SURNAME Anderson with the previous non NA value. — Kp_Data, Sep 12 '18 at 19:48
Have a look at my edits, I'd personally use `zoo` or write an `Rcpp` function, but it's possible you can have a vectorized implementation that is faster than my second edit (I'm sure it can be optimized but don't have the time to think about it right now) — konvas, Sep 12 '18 at 20:18

acylam · Answer 2 · 2018-09-12T20:36:01.490

0

With fill from tidyr:

library(dplyr)
library(tidyr)

df %>%
  group_by(Name, Country) %>%
  fill(Unique_number)

Output:

# A tibble: 6 x 5
# Groups:   Name, Country [4]
     ID Name  Surname  Country Unique_number
  <int> <fct> <fct>    <fct>           <int>
1     1 John  Snow     UK              12345
2     1 John  David    UK              12345
3     1 John  Anderson USA             53214
4     2 Kim   Snow     UK              62321
5     2 Kim   David    UK              62321
6     2 Kim   Anderson USA             77832

Data:

df <- structure(list(ID = c(1L, 1L, 1L, 2L, 2L, 2L), Name = structure(c(1L, 
1L, 1L, 2L, 2L, 2L), .Label = c("John", "Kim"), class = "factor"), 
    Surname = structure(c(3L, 1L, 2L, 3L, 1L, 2L), .Label = c("Anderson", 
    "David", "Snow"), class = "factor"), Country = structure(c(1L, 
    2L, 1L, 1L, 2L, 1L), .Label = c("UK", "USA"), class = "factor"), 
    Unique_number = c(12345L, 53214L, NA, 62321L, 77832L, NA)), .Names = c("ID", 
"Name", "Surname", "Country", "Unique_number"), class = "data.frame", row.names = c(NA, 
-6L))

edited Sep 12 '18 at 20:36

answered Sep 12 '18 at 19:43

acylam

18,231
5
36
45

Thank you, But there is a problem . I cannot replace all the NA's with the previous values because i have other NA's which i want to keep as NA's. – Kp_Data Sep 12 '18 at 19:57
My Ideal code would be if it could only replace SURNAME Anderson with the previous non NA value. – Kp_Data Sep 12 '18 at 19:57
@Kp_Data I don't quite understand your desired output. You should include that example in your question. – acylam Sep 12 '18 at 19:58
I Have Edited the question. – Kp_Data Sep 12 '18 at 20:10
@Kp_Data I've updated my answer. Looks like you just need to add a `group_by`. The rows are automatically sorted by group – acylam Sep 12 '18 at 20:36

R dataframe, changing the value in a variable with a value from the same variable

2 Answers2