I am running PySpark v1.6.0, and I have a column of string values (according to .printSchema), but when I attempt to filter the rows according to cases where the column value starts with a "[" character or contains a "," character, in both cases, it's saying that the rows that I'm expecting to evaluate to True are still False...
When I run the code:
col_name = "attempt_params_attempt_response_id"
resultDF.select(col_name, resultDF[col_name].like(",")).show(50)
I get:
I don't understand how this is possible because the string value clearly contains a comma, so that row should return true
, not false
.
Similarly, when I try casting the rows to ArrayType(StringType())
(which is my ultimate goal), it also behaves as if my rows don't contain a comma...
When I run the code:
from pyspark.sql.types import ArrayType, IntegerType, StringType
col_name = "attempt_params_attempt_response_id"
resultDF.withColumn(col_name,
split(resultDF[col_name], ",\s*")
.cast(ArrayType(StringType()))).select(col_name).show(40)
I get the results:
I wonder if perhaps there's some sort of bizarre encoding issue that is causing the character ,
to not match the character that in the data appears to be a ,
character... but I really am not sure. Any ideas on why this is happening and how I can actually get the cast to work without creating the text of a multi-dimensional array?