0

I am having trouble with a reasonable sized data.table containing 30 or so columns: (note I am using dummy values below)

Using rbindlist(list(dat, dat2)) to add a new data.table with same fields with another 50000 rows produces an incorrect new master data.table.

Is there a simple and fast solution to add new rows to a data.table where the column fields all match?

To simplify, I have created a dummy dataset.

master.df <- data.frame(id = letters[1:10], 
                    mpg = sample(c(20,22), 10, replace = TRUE),
                    cyl = sample(c(4,8), 10, replace = TRUE),
                    disp = sample(c(160,300), 10, replace = TRUE),
                    factor = sample(c(TRUE, FALSE), 10, replace = TRUE),   
                    hp = sample(c(20,22), 10, replace = TRUE))

newTable.df <- data.frame(id = letters[11:15], 
                        mpg = sample(c(20,22), 5, replace = TRUE),
                        cyl = sample(c(4,8), 5, replace = TRUE),
                        disp = sample(c(160,300), 5, replace = TRUE),
                        factor = sample(c(TRUE, FALSE), 10, replace = TRUE), 
                        hp = sample(c(20,22), 5, replace = TRUE))

library(data.table)

dat = as.data.table(master.df)
dat2 = as.data.table(newTable.df)

Using rbind(dat,dat2) outputs duplicate dat2. (expected should be total 15 rows)

I read forums for better solutions and something came up with rbindlist but that does not look like it does the trick either. Same output as rbind

Is there a fast solution that binds dat2 to dat without the duplication?

output for rbind and rbindlist

    id mpg cyl disp factor hp
 1:  a  22   8  300  FALSE 20
 2:  b  20   8  300   TRUE 20
 3:  c  20   8  160  FALSE 20
 4:  d  20   4  300   TRUE 22
 5:  e  22   4  160  FALSE 22
 6:  f  22   4  160   TRUE 22
 7:  g  20   8  160  FALSE 20
 8:  h  22   4  300  FALSE 20
 9:  i  22   4  160  FALSE 20
10:  j  22   8  160   TRUE 22
11:  k  22   8  160  FALSE 20
12:  l  22   8  160   TRUE 20
13:  m  20   8  300   TRUE 20
14:  n  22   4  300  FALSE 20
15:  o  20   8  160  FALSE 20
16:  k  22   8  160  FALSE 20
17:  l  22   8  160  FALSE 20
18:  m  20   8  300  FALSE 20
19:  n  22   4  300   TRUE 20
20:  o  20   8  160   TRUE 20
Community
  • 1
  • 1
digdeep
  • 624
  • 2
  • 10
  • 21
  • [how-to-make-a-great-r-reproducible-example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) – eddi Jul 24 '13 at 14:46
  • 1
    Hi there. It would be helpful if you could explain what you are seeing, what you expect to see, etc. – Ricardo Saporta Jul 24 '13 at 14:49
  • 1
    I don't understand why you don't use `rbind(dat,dat2)`. – Roland Jul 24 '13 at 14:51
  • Thanks for the link eddi, useful. Hey Roland, I tried rbind first but it outputs duplicate values. – digdeep Jul 24 '13 at 15:29
  • Hey Ricardo, I have updated the post to be a bit more contextual :) – digdeep Jul 24 '13 at 15:30
  • 3
    The duplication is being caused by the line `factor = sample(c(TRUE, FALSE), 10, replace = TRUE)` in your creation of `newTable.df`. Change the `10` to a `5`, and all should be well. – Jean V. Adams Jul 24 '13 at 15:32
  • oops, good one, bad coding by me, currently working in the wee hours to replicate an issue with dummy data without having to use the real data set. My real dataset uses actual values by rows which is 5000 new line items going into a dataset with already 500000 rows but interrogating the data, the values from the bind are completely wacky – digdeep Jul 24 '13 at 15:37

2 Answers2

3

Try using unique:

 unique(rbind(dat1, dat2))
Ricardo Saporta
  • 54,400
  • 17
  • 144
  • 178
1

Your problem is that when creating newTable.df you have the following line:

factor = sample(c(TRUE, FALSE), 10, replace = TRUE)

This causes the resulting table to have 10 rows (instead of 5 as you intended). Once you change this 10 into 5, the dat2 data.table will have 5 rows, and rbind(dat, dat2) will have 15 rows.

Pasha
  • 6,298
  • 2
  • 22
  • 34