Add an incremental letter to filled NAs from na.locf()

Question

I have a data.frame that looks like this:

df <- structure(list(
  a = c("atg", "tga", "agt", "acc", "cgt", "gca",
    "gtc", "ggg", "ccc"),
  b = c("1", "2", NA, "3", NA, NA, "4", "5",
    "6")
),
row.names = c(NA, -9L),
class = "data.frame")

I have replaced the NAs with the nearest non-NA using na.locf from the zoo package, but I need to add an incremental letter to the replaced NA values, so that the end product looks like this:

> df
    a    b
1 atg    1
2 tga    2
3 agt    2a
4 acc    3
5 cgt    3a
6 gca    3b
7 gtc    4
8 ggg    5
9 ccc    6

I wrote a small if function, that fills the NA appropriately but adds letters to all values and recycles the numbers to match the length of letters. I can see that this result is from the any call within the function I am now thinking I probably need to do a for loop and use that to increment through each cell, however a for loop with a variant of the if statement doesn't do anything. Any suggestions are welcome.

> testif <- function(x) {
+   if (any(is.na(x)))  {
+     paste(na.locf(x), letters, sep = "")
+   }
+ }

for (x in df$b)     {
+     if (any(is.na(x)))  {
+         paste(test$b, na.locf(x), letters, sep = "")
+     }
+ }

G. Grothendieck · Accepted Answer · 2019-02-21T16:42:07.367

3

Define seq_let which gives a sequence of letters the length of its argument if its argument is all NA and "" otherwise. Then group the NAs and non-NA runs using ave and rleid and apply seq_let to each group prepending na.locf0(b) to it.

library(data.table)
library(zoo)

seq_let <- function(x) if (all(is.na(x))) letters[seq_along(x)] else ""
transform(df, b = paste0(na.locf0(b), ave(b, rleid(is.na(b)), FUN = seq_let)))

giving:

    a  b
1 atg  1
2 tga  2
3 agt 2a
4 acc  3
5 cgt 3a
6 gca 3b
7 gtc  4
8 ggg  5
9 ccc  6

edited Feb 21 '19 at 16:42

answered Feb 21 '19 at 16:35

G. Grothendieck

254,981
17
203
341

Thanks. I was wondering if there was a way to do it by grouping too. Yours was first, so you get the answered. – bob1 Feb 21 '19 at 16:47
The answers are only equivalent if b has no non-NA duplicates. This one still works in that case. Can't say whether that is important or not. – G. Grothendieck Feb 21 '19 at 16:54
Thanks for that. I don't know if there are any non-NA duplicates, the full dataset is something like 45 million entries (3rd-gen sequencing data) which have been mapped to a reference, so each read could potentially have NAs at the same reference position (column b). I'll test this out and see how it goes on a subset of the full, making sure to include a few reads. – bob1 Feb 21 '19 at 17:00

score 2 · Answer 2 · answered Feb 21 '19 at 16:39

2

Do with zoo and base R

x=zoo::na.locf(df$b)
s=as.numeric(ave(x,x,FUN=function(x) seq_along(x)))-1
x[s!=0]=paste0(x[s!=0],letters[s])
df$b=x
df
    a  b
1 atg  1
2 tga  2
3 agt 2a
4 acc  3
5 cgt 3a
6 gca 3b
7 gtc  4
8 ggg  5
9 ccc  6

answered Feb 21 '19 at 16:39

BENY

317,841
20
164
234

Thanks. That's nice and compact, as well as nicely readable to a novice `r` user. – bob1 Feb 21 '19 at 16:47

Henrik · Answer 3 · 2019-02-21T21:46:16.317

Borrowing code from Create counter within consecutive runs of certain values:

i <- is.na(df$b)
g <- cumsum(i)
df$b <- paste0(na.locf(df$b), c("", letters)[g - cummax((!i) * g) + 1])

#     a  b
# 1 atg  1
# 2 tga  2
# 3 agt 2a
# 4 acc  3
# 5 cgt 3a
# 6 gca 3b
# 7 gtc  4
# 8 ggg  5
# 9 ccc  6

More compact using data.table, picking the main idea from: Count consecutive TRUE values within each block separately

library(data.table)

setDT(df)[ ,  b := paste0(na.locf(b), c("", letters)[rowid(rleid(b)) * is.na(b) + 1])]

#      a  b
# 1: atg  1
# 2: tga  2
# 3: agt 2a
# 4: acc  3
# 5: cgt 3a
# 6: gca 3b
# 7: gtc  4
# 8: ggg  5
# 9: ccc  6

Add an incremental letter to filled NAs from na.locf()

3 Answers3