1

I want to understand how colRegex is working in pyspark. How the colRegex is choosing between Col1 and Col2.

df = spark.createDataFrame([("a", 1), ("b", 2), ("c",  3)], ["Col1", "Col2"])

df.select(df.colRegex("`(Col1)?+.+`")).show()

+----+
|Col2|
+----+
|   1|
|   2|
|   3|
+----+

df.select(df.colRegex("`(Col2)?+.+`")).show()

+----+
|Col1|
+----+
|   a|
|   b|
|   c|
+----+

In the above expression I used "(Col3)?+.+". It gives me Col1 and Col2. Can you elaborate what is happening in the regex expression?

Oli
  • 9,766
  • 5
  • 25
  • 46

1 Answers1

1

Your question is actually not really about colRegex. That function simply selects all the columns that match the provided regex. Spark being written in scala, it relies on scala/java regex implementation.

Indeed:

scala> "Col1".matches("(Col2)?+.+")
res28: Boolean = true
scala> "Col1".matches("(xxxxx)?+.+")
res28: Boolean = true
scala> "Col2".matches("(Col2)?+.+")
res28: Boolean = false

So the question is, how to explain that in java/scala Col2 is not matched by the regex (Col2)?+.+. If we have a look at the javadoc of the Pattern class, we read this:

Possessive quantifiers

X?+: X, once or not at all

This post Greedy vs. Reluctant vs. Possessive Qualifiers explains nicely what a possessive quantifier is. In a nutshell, it is a greedy quantifier that does not backtrack when it fails. In this case, Col2 matches (Col2)?+ and then there is nothing to match with .+ that requires at least one character.

With ?, the greedy (usual) quantifier, the matching would backtrack, match (Col2)? with nothing and return true because Col2 matches .+.

With ?+ however, the possessive quantifier, it does not backtrack and fails there. This is why (Col2)?+.+ matches all non empty expressions except Col2.

Oli
  • 9,766
  • 5
  • 25
  • 46