How can I query where column exists in another column?

Question

I have a dataframe that contains column with page names and another column which contains Json with page list. I would like to check if the page name appears in the page list and filter it if it doesn't.

How can I do it?

df for example:

+---------+--------------------------------+
|page name|page_list                       |
+---------+--------------------------------+
|home     |{page_list:['home','something']}|
|about    |{page_list:['something']}       |
+---------+--------------------------------+

score 1 · Answer 1 · answered Jan 06 '20 at 16:09

Assuming that your DataFrame schema is like the following (here the page_list column is a string):

df.printSchema()
#root
# |-- page_name: string (nullable = true)
# |-- page_list: string (nullable = true)

You can use from_json to get the page_list as an array of strings. Then use array_contains to check if the page_name is in this list.

The trick is that you will have to use expr to pass a column value as a parameter to array_contains.

from pyspark.sql.types import StructType, StructField, ArrayType, StringType
from pyspark.sql.functions import expr, from_json

df.withColumn(
    "flag",
    from_json(
        "page_list", 
        schema=StructType([StructField("page_list", ArrayType(StringType()))])
    )["page_list"]
).withColumn(
    "flag",
    expr("array_contains(flag, page_name)")
).show(truncate=False)
#+---------+----------------------------------+-----+
#|page_name|page_list                         |flag |
#+---------+----------------------------------+-----+
#|home     |{"page_list":["home","something"]}|true |
#|about    |{"page_list":["something"]}       |false|
#+---------+----------------------------------+-----+

score 0 · Answer 2 · answered Jan 06 '20 at 14:37

Here's a way to do:

df2 = (df
      .rdd
      .map(lambda x: (x.page_name, x.page_list, x.page_name in x.page_list['page_list']))
      .toDF(["page_name", "page_list", "flag"])

df2.show()

+---------+--------------------+-----+
|page_name|           page_list| flag|
+---------+--------------------+-----+
|     home|[page_list -> [ho...| true|
|    about|[page_list -> [so...|false|
+---------+--------------------+-----+

blackbishop · Answer 3 · 2020-01-06T21:23:37.497

0

If the column page_list is of type string, you could simply use contains function like this:

quoted_page_name = concat(lit("'"), col("page_name"), lit("'"))    
df.withColumn("flag", col("page_list").contains(quoted_page_name)).show()

Gives:

+---------+----------------------------------+-----+
|page_name|page_list                         |flag |
+---------+----------------------------------+-----+
|home     |{page_list: ['home', 'something']}|true |
|about    |{page_list: ['something']}        |false|
+---------+----------------------------------+-----+

edited Jan 06 '20 at 21:23

answered Jan 06 '20 at 18:59

blackbishop

30,945
11
55
76

I would recommend against doing this - it would break if the `page_name` column appeared as a substring anywhere in `page_list`. For example: `{page_list: ['xxxhomeyyy']}` would mistakenly match for `home` – pault Jan 06 '20 at 20:39
@pault Yes I agree if it is used as it is. But one can simply quote the page name, i.e `'home'`...using `concat(lit("'"), col("page_name"), lit("'"))` and it should give correct result. – blackbishop Jan 06 '20 at 21:18

How can I query where column exists in another column?

3 Answers3