1

I have a data.frame that looks like this:

df <- structure(list(
  a = c("atg", "tga", "agt", "acc", "cgt", "gca",
    "gtc", "ggg", "ccc"),
  b = c("1", "2", NA, "3", NA, NA, "4", "5",
    "6")
),
row.names = c(NA, -9L),
class = "data.frame")

I have replaced the NAs with the nearest non-NA using na.locf from the zoo package, but I need to add an incremental letter to the replaced NA values, so that the end product looks like this:

> df
    a    b
1 atg    1
2 tga    2
3 agt    2a
4 acc    3
5 cgt    3a
6 gca    3b
7 gtc    4
8 ggg    5
9 ccc    6

I wrote a small if function, that fills the NA appropriately but adds letters to all values and recycles the numbers to match the length of letters. I can see that this result is from the any call within the function I am now thinking I probably need to do a for loop and use that to increment through each cell, however a for loop with a variant of the if statement doesn't do anything. Any suggestions are welcome.

> testif <- function(x) {
+   if (any(is.na(x)))  {
+     paste(na.locf(x), letters, sep = "")
+   }
+ }

for (x in df$b)     {
+     if (any(is.na(x)))  {
+         paste(test$b, na.locf(x), letters, sep = "")
+     }
+ }
Konrad Rudolph
  • 530,221
  • 131
  • 937
  • 1,214
bob1
  • 398
  • 3
  • 12

3 Answers3

3

Define seq_let which gives a sequence of letters the length of its argument if its argument is all NA and "" otherwise. Then group the NAs and non-NA runs using ave and rleid and apply seq_let to each group prepending na.locf0(b) to it.

library(data.table)
library(zoo)

seq_let <- function(x) if (all(is.na(x))) letters[seq_along(x)] else ""
transform(df, b = paste0(na.locf0(b), ave(b, rleid(is.na(b)), FUN = seq_let)))

giving:

    a  b
1 atg  1
2 tga  2
3 agt 2a
4 acc  3
5 cgt 3a
6 gca 3b
7 gtc  4
8 ggg  5
9 ccc  6
G. Grothendieck
  • 254,981
  • 17
  • 203
  • 341
  • Thanks. I was wondering if there was a way to do it by grouping too. Yours was first, so you get the answered. – bob1 Feb 21 '19 at 16:47
  • The answers are only equivalent if b has no non-NA duplicates. This one still works in that case. Can't say whether that is important or not. – G. Grothendieck Feb 21 '19 at 16:54
  • Thanks for that. I don't know if there are any non-NA duplicates, the full dataset is something like 45 million entries (3rd-gen sequencing data) which have been mapped to a reference, so each read could potentially have NAs at the same reference position (column b). I'll test this out and see how it goes on a subset of the full, making sure to include a few reads. – bob1 Feb 21 '19 at 17:00
2

Do with zoo and base R

x=zoo::na.locf(df$b)
s=as.numeric(ave(x,x,FUN=function(x) seq_along(x)))-1
x[s!=0]=paste0(x[s!=0],letters[s])
df$b=x
df
    a  b
1 atg  1
2 tga  2
3 agt 2a
4 acc  3
5 cgt 3a
6 gca 3b
7 gtc  4
8 ggg  5
9 ccc  6
BENY
  • 317,841
  • 20
  • 164
  • 234
0

Borrowing code from Create counter within consecutive runs of certain values:

i <- is.na(df$b)
g <- cumsum(i)
df$b <- paste0(na.locf(df$b), c("", letters)[g - cummax((!i) * g) + 1])

#     a  b
# 1 atg  1
# 2 tga  2
# 3 agt 2a
# 4 acc  3
# 5 cgt 3a
# 6 gca 3b
# 7 gtc  4
# 8 ggg  5
# 9 ccc  6

More compact using data.table, picking the main idea from: Count consecutive TRUE values within each block separately

library(data.table)

setDT(df)[ ,  b := paste0(na.locf(b), c("", letters)[rowid(rleid(b)) * is.na(b) + 1])]

#      a  b
# 1: atg  1
# 2: tga  2
# 3: agt 2a
# 4: acc  3
# 5: cgt 3a
# 6: gca 3b
# 7: gtc  4
# 8: ggg  5
# 9: ccc  6
Henrik
  • 65,555
  • 14
  • 143
  • 159