When I attempt to convert the values to logical, the "<na>"
values are converted to FALSE
Surprisingly there are not. If you inspect the result further:
spark_lgl_boolean <- spark_lgl %>% mutate_if(is.character, as.logical)
spark_lgl_boolean %>% mutate_all(is.na)
Applying predicate on the first 100 rows
# Source: lazy query [?? x 2]
# Database: spark_connection
a b
<lgl> <lgl>
1 FALSE FALSE
2 FALSE FALSE
3 TRUE TRUE
4 TRUE TRUE
5 FALSE FALSE
6 FALSE FALSE
This is consistent with NA count:
spark_lgl_boolean %>%
mutate_all(is.na) %>%
mutate_all(as.numeric) %>%
summarize_all(sum)
# Source: lazy query [?? x 2]
# Database: spark_connection
a b
<dbl> <dbl>
1 2 2
Spark execution plan:
spark_lgl %>% mutate_if(is.character, as.logical) %>% optimizedPlan
Applying predicate on the first 100 rows
<jobj[1074]>
org.apache.spark.sql.catalyst.plans.logical.Project
Project [a#10, cast(b#11 as boolean) AS b#2037]
+- InMemoryRelation [a#10, b#11], true, 10000, StorageLevel(disk, memory, deserialized, 1 replicas), `test_lgl`
+- *FileScan csv [a#10,b#11] Batched: false, Format: CSV, Location: InMemoryFileIndex[file:/tmp/..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<a:boolean,b:string>
and Spark casting logic for StringType
-> BooleanType
where:
- Strings
TRUE
/ T
(case insensitive) and 1
are converted to true
literal.
- Strings
FALSE
/ F
(case insensitive) and 0
are converted to false
literal.
- Strings not matched above are converted to
NULL
(~NA
).
scala> Seq("tRUE", "FALSE", "f", "<na>", "NA", "1", "0").toDF("x").select($"x".cast("boolean")).show
+-----+
| x|
+-----+
| true|
|false|
|false|
| null|
| null|
| true|
|false|
+-----+
Problem seems to be specifically introduced by sparklyr
conversions. See Improve Serialization on GitHub (credits to kevinykuo for pointing out this one).
However if you stick to Spark based logic, without fetching data to R, things should work just fine (for example if you write data to file).
I'm using spark_read_csv() to load the data
In that case you might take a look at nullValue
and nanValue
options
of the CSV reader. For example:
spark_read_csv(..., options=list(nullValue="<na>"))
or
spark_read_csv(..., options=list(nanValue="<na>"))
but please keep in mind, that Spark semantics of NULL
/ NaN
is not the same as R NA
/ NaN
.