1

I'm running some code, the relevant essence of which is:

library(SparkR)
library(magrittr)
sqlContext %>% sql("select * from tmp") %>% 
  gapply("id", function(key, x) {
    data.frame(
      id = key,
      n = nrow(x)
    )
  }, schema = structType(
    structField("id", "integer"),
    structField("n", "integer")
  ))

Unfortunately, for some values of id, nrow has been calculated incorrectly. Compared to (on a subset of the data) running:

library(data.table)
tmp = sqlContext %>% sql('select * from tmp where id < 1000') %>% collect %>% setDT

And then running (where gapply_df is the collected result of the gapply command above):

gapply_df[tmp[ , .N, keyby = id], on = 'id'][N < n]
#      n   N
# 1: 276 138
# 2: 148  74
# 3: 122  61
# 4: 303 101
# 5: 266 133

I note that the n produced by gapply (n on the left) is sometimes a multiple (here 2x or 3x) of what is actually correct (N on the right).

What could be causing this, and how can it be fixed? I'm worried that nrow is actually giving the right answer (it should be being called on a local data.frame after all), and that my data has been duplicated/triplicated, meaning the rest of my analysis could be wrong as well.

Sorry I can't provide a reproducible example; here's my sessionInfo():

# R version 3.4.1 (2017-06-30)
# Platform: x86_64-pc-linux-gnu (64-bit)
# Running under: Ubuntu 14.04.5 LTS
# Matrix products: default
# BLAS: /usr/lib/libblas/libblas.so.3.0
# LAPACK: /usr/lib/lapack/liblapack.so.3.0
# locale:
# [1] C
# attached base packages:
# [1] stats     graphics  grDevices utils     datasets  methods   base     
# other attached packages:
# [1] data.table_1.10.4 magrittr_1.5      knitr_1.16        SparkR_2.1.1     
# loaded via a namespace (and not attached):
# [1] compiler_3.4.1     markdown_0.8       tools_3.4.1
# [4] KernSmooth_2.23-15 stringi_1.1.5      highr_0.6
# [7] stringr_1.2.0      mime_0.5           evaluate_0.10.1  

Running in Zeppelin with spark 2.1.1.

MichaelChirico
  • 33,841
  • 14
  • 113
  • 198

0 Answers0