10

While analysing some data, I came across the warning message, which I suspect to be a bug as it is a pretty straightforward command that I have worked with many times.

Warning message:
In rbindlist(allargs) : NAs introduced by coercion

I was able to reproduce the error. Here's a code with which you should be able to reproduce the error.

# unique random names for column V1
set.seed(45)
n <- sapply(1:500, function(x) {
    paste(sample(c(letters[1:26]), 10), collapse="")
})
# generate some values for V2 and V3
dt <- data.table(V1 = sample(n, 30*500, replace = TRUE), 
                 V2 = sample(1:10, 30*500, replace = TRUE), 
                 V3 = sample(50:100, 30*500, replace = TRUE))
setkey(dt, "V1")

# No warning when providing column names (and right results)
dt[, list(s = sum(V2), m = mean(V3)),by=V1]

#              V1   s        m
#   1: acgmqyuwpe 238 74.97778
#   2: adcltygwsq 204 79.94118
#   3: adftozibnh 165 75.51515
#   4: aeuowtlskr 164 75.70968
#   5: ahfoqclkpg 192 73.20000
#  ---                        
# 496: zuqegoxkpi  93 77.95000
# 497: zwpserimgf 178 72.62963
# 498: zxkpdrlcsf 154 78.04167
# 499: zxvoaeflhq 121 75.34615
# 500: zyiwcsanlm 180 76.61290

# Warning message and results with NA
dt[, list(sum(V2), mean(V3)),by=V1]

#              V1  V1       V2
#   1: acgmqyuwpe 238 74.97778
#   2: adcltygwsq 204 79.94118
#   3: adftozibnh 165 75.51515
#   4: aeuowtlskr 164 75.70968
#   5: ahfoqclkpg 192 73.20000
#  ---                        
# 496: zuqegoxkpi  NA 77.95000
# 497: zwpserimgf  NA 72.62963
# 498: zxkpdrlcsf  NA 78.04167
# 499: zxvoaeflhq  NA 75.34615
# 500: zyiwcsanlm  NA 76.61290

Warning message:
In rbindlist(allargs) : NAs introduced by coercion
  • 1) It seems that this happens if you don't provide the column names.

  • 2) Even then, in particular, when V1 (or the column you use in by=) has a lot of unique entries (500 here) and you don't specify column names, then this seems to happen. That is, this DOES NOT happen when the by= column V1 has fewer unique entries. For example, try changing the code for n from sapply(1:500, ... to sapply(1:50, ... and you'll get no warning.

What's going on here? Its R version 2.15 on Macbook pro with OS X 10.8.2 (although I tested it on another macbook pro with 2.15.2). Here's the sessionInfo().

> sessionInfo()
R version 2.15.0 (2012-03-30)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.8.6 reshape2_1.2.2  

loaded via a namespace (and not attached):
[1] plyr_1.8      stringr_0.6.2 tools_2.15.0 

Just reproduced with 2.15.2:

> sessionInfo()
R version 2.15.2 (2012-10-26)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] data.table_1.8.6
Arun
  • 116,683
  • 26
  • 284
  • 387
  • 1
    Same thing here with R 2.15.2 under Linux, data.table 1.8.6. – juba Jan 29 '13 at 14:00
  • @juba, Thank you. Yes, I also confirmed it by updating my R version. – Arun Jan 29 '13 at 14:01
  • Is it not a problem with the `V1` column name ? If I rename V1 to V in `dt` then the problem seems gone. And I notice that there are two columns named `V1` in the resulting dataset that raises the warning ? – juba Jan 29 '13 at 14:04
  • Yes, but that shouldn't be the reason for `NA`. Check with a smaller `data.table` and it will work. – Arun Jan 29 '13 at 14:04
  • @matthew, any idea about this error? I haven't checked if this has been filed a bug already (or possibly rectified in the current development version). – Arun Jan 29 '13 at 14:05
  • 2
    Does it happen in v1.8.7? Btw, congratulations for asking the 500th question tagged `data.table`! – Matt Dowle Jan 29 '13 at 14:05
  • Oh thank you :)! I am not able to install `1.8.7` on my mac. It tried `install.packages("data.table", repos="http://R-Forge.R-project.org")` and it gives me back: `package ‘data.table’ is not available (for R version 2.15.2)`. And when I try with `devtools` it ends up with `sh: make: command not found`. – Arun Jan 29 '13 at 14:15
  • That's odd. You have 2.15.2 and R-Forge builds with 2.15.2 so I'm baffled by that message. Unless R-Forge is in the process of building (I did commit last night), but it displays "Current" at rev 800, although the last commit was 801. For devtools, does _any_ package work with devtools or is that the first time you tried devtools at all? – Matt Dowle Jan 29 '13 at 14:20
  • @MatthewDowle, I've used devtools and it has worked before, although I never tried on `data.table`. – Arun Jan 29 '13 at 14:25
  • R-Forge is likely building then (odd that it removes the package first). Anyway, I see the bug in 1.8.7 too. Answer on way ... – Matt Dowle Jan 29 '13 at 15:05

1 Answers1

7

UPDATE : Now fixed in v1.8.9 by Ricardo

o rbind'ing data.tables containing duplicate, "" or NA column names now works, #2726 & #2384. Thanks to Garrett See and Arun Srinivasan for reporting. This also affected the printing of data.tables with duplicate column names since the head and tail are rbind-ed together internally.


Yes, bug. Seems to be in the print method of data.tables with duplicated names.

ans = dt[, list(sum(V2), mean(V3)),by=V1]
head(ans)
           V1  V1       V2     # notice the duplicated V1
1: acgmqyuwpe 140 78.07692
2: adcltygwsq 191 76.93333
3: adftozibnh 153 73.82143
4: aeuowtlskr 122 73.04348
5: ahfoqclkpg 143 75.83333
6: ahtczyuipw 135 73.54167
tail(ans)
           V1  V1       V2
1: zugrnehpmq 189 72.63889
2: zuqegoxkpi 150 76.03333
3: zwpserimgf 180 74.81818
4: zxkpdrlcsf 115 72.57895
5: zxvoaeflhq 157 76.53571
6: zyiwcsanlm 145 72.79167
print(ans)
Error in rbindlist(allargs) : 
    (converted from warning) NAs introduced by coercion
rbind(head(ans),tail(ans))
Error in rbindlist(allargs) : 
    (converted from warning) NAs introduced by coercion

As a work around, don't create data.table with column names V1, V2 etc.

It's arising due to this known bug :

#2384 rbind of tables containing duplicate column names doesn't bind correctly

and I've added a link there back to this question.

Thanks!

Matt Dowle
  • 58,872
  • 22
  • 166
  • 224
  • 1
    if you do the same on a small `data.table`: `dt <- data.table(V1=rep(letters[1:3], each=3), V2=sample(9), V3 = sample(9))`, the `print` doesn't produce errors... I am wondering why..? – Arun Jan 29 '13 at 15:16
  • 3
    @Arun Because only when `dt` is more than 100 rows (by default) does `print(dt)` print the top and bottom by `rbind`'ing together the `head` and `tail`. – Matt Dowle Jan 29 '13 at 15:38