2

My question is - in pyspark dataframe in "rlike" function how to pass the string value row by row from one of dataframe column

Got the error meesagge while run

    df.withColumn("match_str", df.text1.rlike(df.match)).show(truncate=False)

        Py4JError: An error occurred while calling o2165.rlike. Trace:
        py4j.Py4JException: Method rlike([class org.apache.spark.sql.Column]) does not exist

Do you know any workaround or solution ?

    df = spark.createDataFrame([
        (1, 'test1 test1_0|test1 test0', 'This is a test1 test1_0'),
        (2, 'test2 test2_0|test1 test0', None),
        (3, 'Nan', 5.2, 23, 'Nan'),
        (4, 'test4 test4_0|test1 test0', 'This is a test4 test4_0'),
       ], ['id', 'match', 'text1'])



    +---+-------------------------+-----------------------+
    |id |match                    |text1                  |
    +---+-------------------------+-----------------------+
    |1  |test1 test1_0|test1 test0|This is a test1 test1_0|
    |2  |test2 test2_0|test1 test0|null                   |
    |3  |Nan                      |Nan                    |
    |4  |test4 test4_0|test1 test0|This is a test4 test4_0|
    +---+-------------------------+-----------------------+

    root
     |-- id: long (nullable = true)
     |-- match: string (nullable = true)
     |-- text1: string (nullable = true)


    df.withColumn("match_str", df.text1.rlike(df.select(df.match).head()["match"])).show(truncate=False)

Note : df.select(df.match).head()["match"] passing value first row match in this case matching "test1 test1_0|test1 test0" to all rows. I want to pass the rlike value row by row. like

  1. id '1' match 'test1 test1_0|test1 test0' with "This is a test1 test1_0"
  2. id '2' match 'test2 test2_0|test1 test0' with "None"

etc.

    +---+-------------------------+-----------------------+---------+
    |id |match                    |text1                  |match_str|
    +---+-------------------------+-----------------------+---------+
    |1  |test1 test1_0|test1 test0|This is a test1 test1_0|true     |
    |2  |test2 test2_0|test1 test0|null                   |null     |
    |3  |Nan                      |Nan                    |false    |
    |4  |test4 test4_0|test1 test0|This is a test4 test4_0|false    |
    +---+-------------------------+-----------------------+---------+

    df.withColumn("match_str", df.text1.rlike(df.match)).show(truncate=False)

        Py4JError: An error occurred while calling o2165.rlike. Trace:
        py4j.Py4JException: Method rlike([class org.apache.spark.sql.Column]) does not exist

expected results :

    +---+-------------------------+-----------------------+---------+
    |id |match                    |text1                  |match_str|
    +---+-------------------------+-----------------------+---------+
    |1  |test1 test1_0|test1 test0|This is a test1 test1_0|true     |
    |2  |test2 test2_0|test1 test0|null                   |false    |
    |3  |Nan                      |Nan                    |true     |
    |4  |test4 test4_0|test1 test0|This is a test4 test4_0|true     |
    +---+-------------------------+-----------------------+---------+
pault
  • 41,343
  • 15
  • 107
  • 149
sonu
  • 21
  • 2
  • 4
  • check: `df.selectExpr('match', 'text1', 'text1 rlike match').show(truncate=False)` – jxc Oct 02 '19 at 15:45

1 Answers1

2

pyspark.sql.Column.rlike() method unfortunately takes only text patterns, not other columns as pattern (you can adjust it for your needs however using udf-s).

Quick solution for your problem is to use pyspark sql rlike (so like regular sql rlike):

>>> from pyspark.sql import *
>>> from pyspark.sql.functions import *
>>> df = sqlContext.createDataFrame([
...     (1, 'test1 test1_0|test1 test0', 'This is a test1 test1_0'),
...     (2, 'test2 test2_0|test1 test0', None),
...     (3, 'Nan', 'Nan'),
...     (4, 'test4 test4_0|test1 test0', 'This is a test4 test4_0')
...    ], ['id', 'match', 'text1'])
>>> df.select("id", "match", "text1", expr("text1 rlike concat('(', match, ')$') as match_str")).show()
+---+--------------------+--------------------+---------+
| id|               match|               text1|match_str|
+---+--------------------+--------------------+---------+
|  1|test1 test1_0|tes...|This is a test1 t...|     true|
|  2|test2 test2_0|tes...|                null|     null|
|  3|                 Nan|                 Nan|     true|
|  4|test4 test4_0|tes...|This is a test4 t...|     true|
+---+--------------------+--------------------+---------+

Just to slightly modify your example, since you are operating on strings there and "Nan" string equals other "Nan" string:

>>>
... df2 = sqlContext.createDataFrame([
...     (1, 'test1 test1_0|test1 test0', 'This is a test1 test1_0x'),
...     (2, 'test2 test2_0|test1 test0', None),
...     (3, 'NanA', 'Nan'),
...     (4, 'test4 test4_0|test1 test0', 'This is a test4 test4_0')
...    ], ['id', 'match', 'text1'])
>>>
... df2.select("id", "match", "text1", expr("text1 rlike concat('(', match, ')$') as match_str")).show()
+---+--------------------+--------------------+---------+
| id|               match|               text1|match_str|
+---+--------------------+--------------------+---------+
|  1|test1 test1_0|tes...|This is a test1 t...|    false|
|  2|test2 test2_0|tes...|                null|     null|
|  3|                NanA|                 Nan|    false|
|  4|test4 test4_0|tes...|This is a test4 t...|     true|
+---+--------------------+--------------------+---------+
Grzegorz Skibinski
  • 12,624
  • 2
  • 11
  • 34
  • can you please explain me – sonu Oct 05 '19 at 04:17
  • expr("text1 rlike concat('(', match, ')$') as match_str") in this how this actually convert into the executable command. I am also not 100% clear about the $ used. Please explain will help me a lot – sonu Oct 05 '19 at 04:18
  • Yes, so ```expr``` parses it's argument as sql query (or hive QL query, depending what you are using underneath), returning column. Then ```rlike``` is regex like, and it translates ```(a|b)$``` into match ```a``` or ```b``` at the end (indicated by: ```$```). If you would e.g. drop the square brackets it would be ```a|b$```, which would translate to: match ```a``` anywhere, or match ```b``` at the end of string. – Grzegorz Skibinski Oct 05 '19 at 07:20