41

I am using Spark 1.3.0 and Spark Avro 1.0.0. I am working from the example on the repository page. This following code works well

val df = sqlContext.read.avro("src/test/resources/episodes.avro")
df.filter("doctor > 5").write.avro("/tmp/output")

But what if I needed to see if the doctor string contains a substring? Since we are writing our expression inside of a string. What do I do to do a "contains"?

merv
  • 67,214
  • 13
  • 180
  • 245
Knows Not Much
  • 30,395
  • 60
  • 197
  • 373

2 Answers2

97

You can use contains (this works with an arbitrary sequence):

df.filter($"foo".contains("bar"))

like (SQL like with SQL simple regular expression whith _ matching an arbitrary character and % matching an arbitrary sequence):

df.filter($"foo".like("bar"))

or rlike (like with Java regular expressions):

df.filter($"foo".rlike("bar"))

depending on your requirements. LIKE and RLIKE should work with SQL expressions as well.

zero323
  • 322,348
  • 103
  • 959
  • 935
  • 1
    is the above scala code? looks like scala doesn't like the $ sign. I imported `import org.apache.spark.sql.functions.lit` – Knows Not Much Mar 02 '16 at 22:27
  • 18
    Scala. To make `$` work you'll need to `import sqlContext.implicits._`. You can replace it with `df("foo")` or `org.apache.spark.sql.functions.col("foo")` as well. – zero323 Mar 02 '16 at 22:31
  • 1
    is there a similar function in pyspark? – oluies Aug 19 '16 at 13:51
  • 1
    @oluies You can use any of these ( [`like`](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.Column.like), [`rlike`](https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.Column.rlike) directly, contains calling JVM method) on `Column` object. – zero323 Aug 19 '16 at 13:52
  • @zero323 i get an error executing the code : filt_df = df.filter(\ (col('BARCODE_ID').contains('YH','YN','YP','YT','YW','YY','3A','3G','3H')) & \ (df.UOW_PKG_DELIV_MODE_TYPE_DESC.substr(0,17) != 'SUREPOST REDIRECT' )) --> Column not callable .. any thoughts? – E B Apr 05 '18 at 21:16
  • @zero323 hey can I use it for multiple contains condition something like contains("bar" or "foo")? – Abu Shoeb May 18 '18 at 20:06
  • What if I want to search a list of string (contains 10K tokens) in a single string? Regex won't be efficient. Any other way for doing that in Spark? – Reihan_amn Sep 04 '18 at 18:55
1

In pyspark,SparkSql syntax:

where column_n like 'xyz%'

might not work.

Use:

where column_n RLIKE '^xyz' 

This works perfectly fine.

vikrant rana
  • 4,509
  • 6
  • 32
  • 72
Sam91
  • 95
  • 1
  • 8