I'm running some code, the relevant essence of which is:
library(SparkR)
library(magrittr)
sqlContext %>% sql("select * from tmp") %>%
gapply("id", function(key, x) {
data.frame(
id = key,
n = nrow(x)
)
}, schema = structType(
structField("id", "integer"),
structField("n", "integer")
))
Unfortunately, for some values of id
, nrow
has been calculated incorrectly. Compared to (on a subset of the data) running:
library(data.table)
tmp = sqlContext %>% sql('select * from tmp where id < 1000') %>% collect %>% setDT
And then running (where gapply_df
is the collect
ed result of the gapply
command above):
gapply_df[tmp[ , .N, keyby = id], on = 'id'][N < n]
# n N
# 1: 276 138
# 2: 148 74
# 3: 122 61
# 4: 303 101
# 5: 266 133
I note that the n
produced by gapply
(n
on the left) is sometimes a multiple (here 2x or 3x) of what is actually correct (N
on the right).
What could be causing this, and how can it be fixed? I'm worried that nrow
is actually giving the right answer (it should be being called on a local data.frame
after all), and that my data has been duplicated/triplicated, meaning the rest of my analysis could be wrong as well.
Sorry I can't provide a reproducible example; here's my sessionInfo()
:
# R version 3.4.1 (2017-06-30)
# Platform: x86_64-pc-linux-gnu (64-bit)
# Running under: Ubuntu 14.04.5 LTS
# Matrix products: default
# BLAS: /usr/lib/libblas/libblas.so.3.0
# LAPACK: /usr/lib/lapack/liblapack.so.3.0
# locale:
# [1] C
# attached base packages:
# [1] stats graphics grDevices utils datasets methods base
# other attached packages:
# [1] data.table_1.10.4 magrittr_1.5 knitr_1.16 SparkR_2.1.1
# loaded via a namespace (and not attached):
# [1] compiler_3.4.1 markdown_0.8 tools_3.4.1
# [4] KernSmooth_2.23-15 stringi_1.1.5 highr_0.6
# [7] stringr_1.2.0 mime_0.5 evaluate_0.10.1
Running in Zeppelin
with spark
2.1.1
.