40

What is the equivalent in Pyspark for LIKE operator? For example I would like to do:

SELECT * FROM table WHERE column LIKE "*somestring*";

looking for something easy like this (but this is not working):

df.select('column').where(col('column').like("*s*")).show()
Babu
  • 4,324
  • 6
  • 41
  • 60
  • 2
    This is Scala, but pySpark will be essentially identical to this answer: http://stackoverflow.com/questions/35759099/filter-spark-dataframe-on-string-contains – Jeff Oct 24 '16 at 14:43

10 Answers10

65

You can use where and col functions to do the same. where will be used for filtering of data based on a condition (here it is, if a column is like '%string%'). The col('col_name') is used to represent the condition and like is the operator:

df.where(col('col1').like("%string%")).show()
GilZ
  • 6,418
  • 5
  • 30
  • 40
braj
  • 2,545
  • 2
  • 29
  • 40
  • you can use where and col functions to do the same. where will be used for filtering of data based on a condition (here it is, if a column is like '%s%'). The col('col_name') is used to represent the condition and like is the operator. – braj Jan 04 '17 at 07:32
18

Using spark 2.0.0 onwards following also works fine:

df.select('column').where("column like '%s%'").show()

desaiankitb
  • 992
  • 10
  • 17
12

Use the like operator.

https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#module-pyspark.sql.functions

df.filter(df.column.like('%s%')).show()
Rahul
  • 717
  • 9
  • 16
6

To replicate the case-insensitive ILIKE, you can use lower in conjunction with like.

from pyspark.sql.functions import lower

df.where(lower(col('col1')).like("%string%")).show()
yardsale8
  • 940
  • 9
  • 15
5

Well...there should be sql like regexp ->

df.select('column').where(col('column').like("%s%")).show()
Babu
  • 4,324
  • 6
  • 41
  • 60
3

Using spark 2.4, to negate you can simply do:

df = df.filter("column not like '%bla%'")
YOLO
  • 20,181
  • 5
  • 20
  • 40
3

This worked for me:

import pyspark.sql.functions as f
df.where(f.col('column').like("%x%")).show()
StupidWolf
  • 45,075
  • 17
  • 40
  • 72
gauravJ
  • 41
  • 1
  • 4
2

In pyspark you can always register the dataframe as table and query it.

df.registerTempTable('my_table')
query = """SELECT * FROM my_table WHERE column LIKE '*somestring*'"""
sqlContext.sql(query).show()
sau
  • 1,316
  • 4
  • 16
  • 37
  • 1
    In Spark 2.0 and newer use `createOrReplaceTempView` instead, registerTempTable is deprecated. – Davos Aug 26 '19 at 04:38
0

Also CONTAINS can be used:

df = df.where(col("columnname").contains("somestring"))

user11222393
  • 3,245
  • 3
  • 13
  • 23
-3

I always use a UDF to implement such functionality:

from pyspark.sql import functions as F 
like_f = F.udf(lambda col: True if 's' in col else False, BooleanType())
df.filter(like_f('column')).select('column')
Allen211
  • 69
  • 6
  • 3
    While functional, using a python UDF will be slower than using the column function `like(...)`. The reason for this is using a pyspark UDF requires that the data get converted between the JVM and Python. Furthermore, the dataframe engine can't optimize a plan with a pyspark UDF as well as it can with its built in functions. – kamprath Jun 04 '17 at 03:11