0

How can I use existing pySpark sql functions to find non-consuming regular expression patterns in a string column?

The following is reproducible, but does not give the desired results.

import pyspark
from pyspark.sql import (
    SparkSession,
    functions as F)

spark = (SparkSession.builder
         .master('yarn')
         .appName("regex")
         .getOrCreate()
         )

sc = spark.sparkContext
sc.version # u'2.2.0'

testdf = spark.createDataFrame([
    (1, "Julie", "CEO"),
    (2, "Janice", "CFO"),
    (3, "Jake", "CTO")],
    ["ID", "Name", "Title"])
ptrn = '(?=Ja)(?=ke)'
testdf.withColumn('contns_ptrn', testdf.Name.rlike(ptrn) ).show()
+---+------+-----+-----------+
| ID|  Name|Title|contns_ptrn|
+---+------+-----+-----------+
|  1| Julie|  CEO|      false|
|  2|Janice|  CFO|      false|
|  3|  Jake|  CTO|      false|
+---+------+-----+-----------+
testdf.withColumn('contns_ptrn', F.regexp_extract(F.col('Name'), ptrn, 1)).show()
+---+------+-----+-----------+
| ID|  Name|Title|contns_ptrn|
+---+------+-----+-----------+
|  1| Julie|  CEO|           |
|  2|Janice|  CFO|           |
|  3|  Jake|  CTO|           |
+---+------+-----+-----------+
testdf.withColumn('contns_ptrn', F.regexp_replace(F.col('Name'), ptrn, '')).show()
+---+------+-----+-----------+
| ID|  Name|Title|contns_ptrn|
+---+------+-----+-----------+
|  1| Julie|  CEO|      Julie|
|  2|Janice|  CFO|     Janice|
|  3|  Jake|  CTO|       Jake|
+---+------+-----+-----------+

The desired results would be:

+---+------+-----+-----------+
| ID|  Name|Title|contns_ptrn|
+---+------+-----+-----------+
|  1| Julie|  CEO|      false|
|  2|Janice|  CFO|      false|
|  3|  Jake|  CTO|       true|
+---+------+-----+-----------+

As the third row in the Name column contains 'Ja' and 'ke'.

If regexp_extract or regexp_replace are able to extract or replace non-consuming regular expression patterns, then I could also use them together with length to get a Boolean column.

Clay
  • 2,584
  • 1
  • 28
  • 63

1 Answers1

0

Found a quick solution, hopefully this can help someone else.

change ptrn from '(?=Ja)(?=ke)' to '(?=.*Ja)(?=.*ke)' and rlike works.

This answer got me close, but led to my problem. https://stackoverflow.com/a/469951/5060792

These answers solved my problem. https://stackoverflow.com/a/3041326 https://stackoverflow.com/a/470602/5060792

By the way, with nothing but the change to ptrn, regexp_extract throws a java.lang.IndexOutOfBoundsException: No group 1 exception. After wrapping the entire pattern in parenthesis, ptrn = '((?=.*Ja)(?=.*ke))', it returns nulls.

Again, regexp_replace replaces nothing and the original values are returned.

Clay
  • 2,584
  • 1
  • 28
  • 63
  • There is no such thing as a _`a non-consuming regular expression`_. There is no documentation that lists this as anything at all. –  May 17 '19 at 00:11
  • Got that terminology from https://stackoverflow.com/a/469951/5060792. Same as a 'Positive lookahead', wherein several references say something along the lines of Positive lookahead patterns do 'not consume any characters or expand the match.' So: do not consume, non-consuming, and the included `(?='pattern')` - clear enough. – Clay May 17 '19 at 02:03
  • No, not really. To be clear, _Assertions do not consume characters_. It is an attribute of these constructs. It includes BOS, EOS, Anchors, neg/pos lookbehinds, word boundary and others. `^$\A\Z\z\G\b\B(?=)(?<=)(?!)(?<!)` etc... It's a stretch to think _`a non-consuming regular expression`_ has any specific quantifiable meaning, get my drift ? –  May 18 '19 at 01:58