8

I am trying to add a row of NAs after each group of data in R.

A similar question has been asked earlier. Insert a blank row after each group of data.

The accepted answer works fine in this case too as follows.

group <- c("a","b","b","c","c","c","d","d","d","d")
xvalue <- c(16:25)
yvalue <- c(1:10)
df <- data.frame(cbind(group,xvalue,yvalue))
df_new <- as.data.frame(lapply(df, as.character), stringsAsFactors = FALSE)
head(do.call(rbind, by(df_new, df$group, rbind, NA)), -1 )
     group xvalue yvalue
a.1      a     16      1
a.2   <NA>   <NA>   <NA>
b.2      b     17      2
b.3      b     18      3
b.31  <NA>   <NA>   <NA>
c.4      c     19      4
c.5      c     20      5
c.6      c     21      6
c.41  <NA>   <NA>   <NA>
d.7      d     22      7
d.8      d     23      8
d.9      d     24      9
d.10     d     25     10

How can I speed this up using data.table for a large data.frame?

Community
  • 1
  • 1
Crops
  • 5,024
  • 5
  • 38
  • 65

2 Answers2

10

You could try

df$group <- as.character(df$group)
setDT(df)[, .SD[1:(.N+1)], by=group][is.na(xvalue), group:=NA][!.N]
#     group xvalue yvalue
#1:     a     16      1
#2:    NA     NA     NA
#3:     b     17      2
#4:     b     18      3
#5:    NA     NA     NA
#6:     c     19      4
#7:     c     20      5
#8:     c     21      6
#9:    NA     NA     NA
#10:    d     22      7
#11:    d     23      8
#12:    d     24      9
#13:    d     25     10

Or as suggested by @David Arenburg

 setDT(df)[, indx := group][, .SD[1:(.N+1)], indx][,indx := NULL][!.N]

Or

 setDT(df)[df[,.I[1:(.N+1)], group]$V1][!.N]

Or it could be further simplified based on @eddi's comments

 setDT(df)[df[, c(.I, NA), group]$V1][!.N]
akrun
  • 874,273
  • 37
  • 540
  • 662
  • 2
    That is very very neat solution, though I think you could avoid messing around with `group` and just create some index and leave it there (or get rid of it afterwards), maybe something like `setDT(df)[, indx := .GRP, group][, .SD[1:(.N+1)], indx]` – David Arenburg Jan 01 '15 at 11:54
  • 2
    Or just `setDT(df)[, indx := group][, .SD[1:(.N+1)], indx][, indx := NULL][]` – David Arenburg Jan 01 '15 at 11:56
  • 1
    @DavidArenburg I think if I use `.I`, it could be made more compact. – akrun Jan 01 '15 at 12:08
  • 1
    The last solution is very neat – David Arenburg Jan 01 '15 at 12:10
  • 1
    I think `...df[, c(.I, NA), group]...` is easier to read/understand – eddi Jan 02 '15 at 16:15
  • @User231 assuming that you are asking about the first solution. We convert the column to character class `as.character`. then convert to data.table (`setDT(df)`), grouped by 'group', get the sequence from 1 to .N+1 i.e. 1 element more than the number or rows per group, use `.SD` to subset the rows based on the index. Now, as the .N+1 rows is not present, it will give an NA row for each group at the end., then the group value is changed to NA (based on the NA in another column) so we a NA row after each group – akrun May 05 '17 at 08:38
  • what if instead of `NA` you want to replace it with a string? – road_to_quantdom Feb 11 '19 at 21:21
5

One way I could think of is to construct a vector first as follows:

foo <- function(x) {
    o = order(rep.int(seq_along(x), 2L))
    c(x, rep.int(NA, length(x)))[o]
}
join_values = head(foo(unique(df_new$group)), -1L)
# [1] "a" NA  "b" NA  "c" NA  "d"

And then setkey() and join.

setkey(setDT(df_new), group)
df_new[.(join_values), allow.cartesian=TRUE]
#     group xvalue yvalue
#  1:     a     16      1
#  2:    NA     NA     NA
#  3:     b     17      2
#  4:     b     18      3
#  5:    NA     NA     NA
#  6:     c     19      4
#  7:     c     20      5
#  8:     c     21      6
#  9:    NA     NA     NA
# 10:     d     22      7
# 11:     d     23      8
# 12:     d     24      9
# 13:     d     25     10
Arun
  • 116,683
  • 26
  • 284
  • 387
  • Do you think this is going to improve performance somehow? Cause akrun solution seems very idiomatic to me – David Arenburg Jan 01 '15 at 11:59
  • 4
    @DavidArenburg, I don't follow why either of them should be idiomatic *here*. It's just another way. I used joins because it gives the answer directly, rather than having to replace with NAs later. – Arun Jan 01 '15 at 12:03