2

Let us say I have a data.table

col1   col2    col3
 a     123      1
 a     433      2
 a     322      3       
 b     43       1  
 b     4333     2
 c     43       1

In the above table each category of col1 should have 3 rows. But only category a has 3 rows. So I want to insert blank rows with NAs for missing cases of other categories of col1. Output will look like

col1   col2    col3
 a     123      1
 a     433      2
 a     322      3       
 b     43       1  
 b     4333     2
 b     NA       3
 c     43       1
 c     NA       2
 c     NA       3

How do I achieve it with just one call of a function?

user3664020
  • 2,980
  • 6
  • 24
  • 45
  • 1
    `DT[CJ(col1=col1, col3=col3, unique=TRUE), on=c("col1","col3")]`. Having to repeat the column names like that is annoying. In the next version, it'll be possible to do `DT[CJ(col1,col3,unique=TRUE), on=c("col1","col3")]`. If you load tidyr, `setDT(complete(DT,col1,col3))` is an option.. it might be less efficient. – Frank Nov 06 '15 at 22:42
  • @Frank this is great. What if I want to supply the max parameter (which is 3 in this case) externally? That is, I want to say if the number of rows for say `c` is less than 3 then only insert the remaining rows (2 rows in this case). Or if the number of rows is less than 2, then only insert the remaining rows (1 in this case)? How can I do this? – user3664020 Nov 06 '15 at 22:51
  • I don't really understand the question. Maybe you should post it as a new question with (since you have two answers already)..? – Frank Nov 06 '15 at 23:28
  • This is a duplicate of at least 2-3 questions that I've seen in the last month or two. – alexwhitworth Nov 07 '15 at 00:28
  • Possible duplicate of [how to insert missing observations on a data frame](http://stackoverflow.com/questions/33003819/how-to-insert-missing-observations-on-a-data-frame) – alexwhitworth Nov 07 '15 at 00:29
  • Duplicate #2: http://stackoverflow.com/questions/32979865/easiest-way-to-add-missing-rows-in-r? – alexwhitworth Nov 07 '15 at 00:30
  • @Alex Yes, I bet this has been asked before, but no, I don't think either of those are good dupe targets. Try this one instead, maybe: http://stackoverflow.com/q/27944002/1191259 or the one Colonel Beauvel links from there: http://stackoverflow.com/q/26591748/1191259 or this one: http://stackoverflow.com/q/10954602/1191259 None of those specify that they want to know what to do with a data.table specifically, though. (That may or may not be enough reason for not duping.) – Frank Nov 07 '15 at 04:34

3 Answers3

3

data.table I think the idiom is a merge with the Cartesian/cross product of col1 and col3

(as also in @Jealie and @PLapointe's answers):

DT[CJ(col1 = col1, col3 = col3, unique=TRUE), on = c("col1", "col3")]

   col1 col2 col3
1:    a  123    1
2:    a  433    2
3:    a  322    3
4:    b   43    1
5:    b 4333    2
6:    b   NA    3
7:    c   43    1
8:    c   NA    2
9:    c   NA    3

CJ constructs the Cartesian product, and A[B,on=cols] does a merge with all rows of B in the result.


tidyr Outside of data.table, another alternative with nicer syntax is in tidyr:

library(tidyr)
complete(DT, col1, col3)

Unfortunately, this doesn't return a data.table. You can use setDT on the result to fix that.

Frank
  • 66,179
  • 8
  • 96
  • 180
  • My task requires me to delete these inserted artificial rows later. How do I do that? – user3664020 Nov 08 '15 at 18:39
  • @user3664020 Hm, you could do something like `res = DT[, is_artificial := FALSE][CJ(col1 = col1, col3 = col3, unique=TRUE), on = c("col1", "col3")][is.na(is_artificial), is_artificial := TRUE]`. Then you can filter on that column later. – Frank Nov 08 '15 at 18:43
1

I would go with an application of merge (available for both data.table and data.frame).

Let's start by creating a pattern of what we would like to have:

> pattern = data.frame(col1=rep(letters[1:3], each=3), col3=rep(1:3,3))
> pattern
  col1 col3
1    a    1
2    a    2
3    a    3
4    b    1
5    b    2
6    b    3
7    c    1
8    c    2
9    c    3

And then merge this pattern with the real data:

> merge(pattern, real_data, all.x=T, by=c('col1','col3'))
  col1 col3 col2
1    a    1  123
2    a    2  433
3    a    3  322
4    b    1   43
5    b    2 4333
6    b    3   NA
7    c    1   43
8    c    2   NA
9    c    3   NA

NB: real_data was obtained here with:

# data.table:
real_data = structure(list(col1 = structure(c(1L, 1L, 1L, 2L, 2L, 3L), .Label = c("a", "b", "c"), class = "factor"), col2 = c(123L, 433L, 322L, 43L, 4333L, 43L), col3 = c(1L, 2L, 3L, 1L, 2L, 1L)), .Names = c("col1", "col2", "col3"), class = c("data.table","data.frame"), row.names = c(NA, -6L))
# or data.frame:
real_data = structure(list(col1 = structure(c(1L, 1L, 1L, 2L, 2L, 3L), .Label = c("a", "b", "c"), class = "factor"), col2 = c(123L, 433L, 322L, 43L, 4333L, 43L), col3 = c(1L, 2L, 3L, 1L, 2L, 1L)), .Names = c("col1", "col2", "col3"), class = "data.frame", row.names = c(NA, -6L))
Jealie
  • 6,157
  • 2
  • 33
  • 36
1

Or you could use full_join in dplyr:

table1 <-read.table(text="col1   col2    col3
 a     123      1
 a     433      2
 a     322      3
 b     43       1
 b     4333     2
 c     43       1", header=T,stringsAsFactors =F)

library(dplyr)
all1 <-expand.grid(letters[1:3],1:3, stringsAsFactors = F)
colnames(all1) <-c("col1","col3")
full_join(table1,all1,c("col1","col3"))

  col1 col2 col3
1    a  123    1
2    a  433    2
3    a  322    3
4    b   43    1
5    b 4333    2
6    c   43    1
7    c   NA    2
8    b   NA    3
9    c   NA    3
Pierre Lapointe
  • 16,017
  • 2
  • 43
  • 56