1

I am running PySpark v1.6.0, and I have a column of string values (according to .printSchema), but when I attempt to filter the rows according to cases where the column value starts with a "[" character or contains a "," character, in both cases, it's saying that the rows that I'm expecting to evaluate to True are still False...

When I run the code:

col_name = "attempt_params_attempt_response_id"
resultDF.select(col_name, resultDF[col_name].like(",")).show(50)

I get:

Picture where column value should be true but is false

I don't understand how this is possible because the string value clearly contains a comma, so that row should return true, not false.

Similarly, when I try casting the rows to ArrayType(StringType()) (which is my ultimate goal), it also behaves as if my rows don't contain a comma...

When I run the code:

from pyspark.sql.types import ArrayType, IntegerType, StringType

col_name = "attempt_params_attempt_response_id"
resultDF.withColumn(col_name, 
                    split(resultDF[col_name], ",\s*")
                    .cast(ArrayType(StringType()))).select(col_name).show(40)

I get the results:

Multidimensional array as if there was no comma to split on

I wonder if perhaps there's some sort of bizarre encoding issue that is causing the character , to not match the character that in the data appears to be a , character... but I really am not sure. Any ideas on why this is happening and how I can actually get the cast to work without creating the text of a multi-dimensional array?

devinbost
  • 4,658
  • 2
  • 44
  • 57

1 Answers1

1

In the case your pattern is wrong. like is equivalent to SQL like using simple regular expression, so , matches only literal ,.

df = spark.createDataFrame([("[0,2,3]", ), (",", )], ("text", ))
df.withColumn("contains_comma", col("text").like(",")).show()
+-------+--------------+
|   text|contains_comma|
+-------+--------------+
|[0,2,3]|         false|
|      ,|          true|
+-------+--------------+

To get a match you should add leading and trailing wildcards:

df.withColumn("contains_comma", col("text").like("%,%")).show()


# +-------+--------------+
# |   text|contains_comma|
# +-------+--------------+
# |[0,2,3]|          true|
# |      ,|          true|
# +-------+--------------+

In the second case there is no problem at all. Since you split on , the first item will contain a leading [

df.withColumn("contains_comma", split("text", ",\s*")[0]).show()
+-------+--------------+
|   text|contains_comma|
+-------+--------------+
|[0,2,3]|            [0|
|      ,|              |
+-------+--------------+

and the last trailing ]. You should strip these if you don't want them in the output, for example using regexp_replace:

split(regexp_replace("text", "^\[|\]$",  ""), ",")
zero323
  • 322,348
  • 103
  • 959
  • 935
  • I also tried .startsWith("["), and those cases were all false too. Do I also need to add a wildcard for that to work? – devinbost Sep 04 '18 at 02:13
  • For `startsWith` no. It uses fixed value as a pattern. However without [reproducible exmample](https://stackoverflow.com/q/48427185/6910411) it is not possible to say what is the problem, and it doesn't seem to directly related to the ones you have described here. – zero323 Sep 04 '18 at 12:58
  • 1
    I was able to get your example to work after modifying your regex pattern to `"^\[|\]$"` – devinbost Sep 04 '18 at 16:29