Pyspark dataframe LIKE operator

Question

What is the equivalent in Pyspark for LIKE operator? For example I would like to do:

SELECT * FROM table WHERE column LIKE "*somestring*";

looking for something easy like this (but this is not working):

df.select('column').where(col('column').like("*s*")).show()

This is Scala, but pySpark will be essentially identical to this answer: http://stackoverflow.com/questions/35759099/filter-spark-dataframe-on-string-contains — Jeff, Oct 24 '16 at 14:43

score 65 · Answer 1 · edited Jan 04 '17 at 09:11

65

You can use where and col functions to do the same. where will be used for filtering of data based on a condition (here it is, if a column is like '%string%'). The col('col_name') is used to represent the condition and like is the operator:

df.where(col('col1').like("%string%")).show()

edited Jan 04 '17 at 09:11

GilZ

6,418
5
30
40

answered Jan 04 '17 at 07:15

braj

2,545
2
29
40

you can use where and col functions to do the same. where will be used for filtering of data based on a condition (here it is, if a column is like '%s%'). The col('col_name') is used to represent the condition and like is the operator. – braj Jan 04 '17 at 07:32

score 18 · Answer 2 · answered Oct 25 '17 at 10:06

18

Using spark 2.0.0 onwards following also works fine:

df.select('column').where("column like '%s%'").show()

answered Oct 25 '17 at 10:06

desaiankitb

992
10
17

1

Is there any way of including for multiple checks - I want to check for both `'%s%'` and `'%r%'`, but I only want to use `LIKE` operator like you have done. – cph_sto Oct 18 '18 at 14:59
5

`df.select('column').where('(col("foo").like("%s%") & (col("bar").like("%s%")')` are you looking for something like this? – desaiankitb Oct 21 '18 at 04:59
Thats exactly what I was looking for. – cph_sto Oct 21 '18 at 15:36
great!! done forget to upvote :D comment and/or answer – desaiankitb Oct 22 '18 at 05:41
Upvoted your answer yesterday itself and now upvoted your comment too :) – cph_sto Oct 22 '18 at 10:04

Rahul · Answer 3 · 2018-12-18T11:29:32.353

12

Use the like operator.

https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#module-pyspark.sql.functions

df.filter(df.column.like('%s%')).show()

edited Dec 18 '18 at 11:29

answered Dec 18 '18 at 11:18

Rahul

717
9
16

score 6 · Answer 4 · answered Feb 15 '19 at 13:12

6

To replicate the case-insensitive ILIKE, you can use lower in conjunction with like.

from pyspark.sql.functions import lower

df.where(lower(col('col1')).like("%string%")).show()

answered Feb 15 '19 at 13:12

yardsale8

940
9
15

score 5 · Answer 5 · answered Oct 24 '16 at 14:37

5

Well...there should be sql like regexp ->

df.select('column').where(col('column').like("%s%")).show()

answered Oct 24 '16 at 14:37

Babu

4,324
6
41
60

score 3 · Answer 6 · answered Oct 25 '19 at 04:01

3

Using spark 2.4, to negate you can simply do:

df = df.filter("column not like '%bla%'")

answered Oct 25 '19 at 04:01

YOLO

20,181
5
20
40

score 3 · Answer 7 · edited Apr 09 '21 at 07:08

3

This worked for me:

import pyspark.sql.functions as f
df.where(f.col('column').like("%x%")).show()

edited Apr 09 '21 at 07:08

StupidWolf

45,075
17
40
72

answered Apr 09 '21 at 02:16

gauravJ

41
1
4

score 2 · Answer 8 · answered Dec 29 '16 at 07:58

2

In pyspark you can always register the dataframe as table and query it.

df.registerTempTable('my_table')
query = """SELECT * FROM my_table WHERE column LIKE '*somestring*'"""
sqlContext.sql(query).show()

answered Dec 29 '16 at 07:58

sau

1,316
4
16
37

1

In Spark 2.0 and newer use `createOrReplaceTempView` instead, registerTempTable is deprecated. – Davos Aug 26 '19 at 04:38

score 0 · Answer 9 · edited Jan 11 '23 at 09:40

0

Also CONTAINS can be used:

df = df.where(col("columnname").contains("somestring"))

edited Jan 11 '23 at 09:40

user11222393

3,245
3
13
23

answered Jan 06 '23 at 18:05

Bruce Wayne

1

Allen211 · Answer 10 · 2017-05-26T06:41:43.093

-3

I always use a UDF to implement such functionality:

from pyspark.sql import functions as F 
like_f = F.udf(lambda col: True if 's' in col else False, BooleanType())
df.filter(like_f('column')).select('column')

edited May 26 '17 at 06:41

answered Dec 14 '16 at 06:53

Allen211

69
6

3

While functional, using a python UDF will be slower than using the column function `like(...)`. The reason for this is using a pyspark UDF requires that the data get converted between the JVM and Python. Furthermore, the dataframe engine can't optimize a plan with a pyspark UDF as well as it can with its built in functions. – kamprath Jun 04 '17 at 03:11

Pyspark dataframe LIKE operator

10 Answers10

Linked