Your question is actually not really about colRegex
. That function simply selects all the columns that match the provided regex. Spark being written in scala, it relies on scala/java regex implementation.
Indeed:
scala> "Col1".matches("(Col2)?+.+")
res28: Boolean = true
scala> "Col1".matches("(xxxxx)?+.+")
res28: Boolean = true
scala> "Col2".matches("(Col2)?+.+")
res28: Boolean = false
So the question is, how to explain that in java/scala Col2
is not matched by the regex (Col2)?+.+
. If we have a look at the javadoc of the Pattern class, we read this:
Possessive quantifiers
X?+
: X
, once or not at all
This post Greedy vs. Reluctant vs. Possessive Qualifiers explains nicely what a possessive quantifier is. In a nutshell, it is a greedy quantifier that does not backtrack when it fails. In this case, Col2
matches (Col2)?+
and then there is nothing to match with .+
that requires at least one character.
With ?
, the greedy (usual) quantifier, the matching would backtrack, match (Col2)?
with nothing and return true because Col2
matches .+
.
With ?+
however, the possessive quantifier, it does not backtrack and fails there. This is why (Col2)?+.+
matches all non empty expressions except Col2
.