I would like to search my text column in a pyspark data frame for phrases. Here is an example to show you what I mean.
sentenceData = spark.createDataFrame([
(0, "Hi I heard about Spark"),
(4, "I wish Java could use case classes"),
(11, "Logistic regression models are neat")],
["id", "sentence"])
If the sentence contains "heard about spark" then categorySpark=1 and categoryHeard=1.
If the sentence contains "java OR regression" then categoryCool=1.
I have about 28 booleans (or maybe better if I use regex) to check for.
sentenceData.withColumn('categoryCool',sentenceData['sentence'].rlike('Java | regression')).show()
returns:
+---+--------------------+------------+
| id| sentence|categoryCool|
+---+--------------------+------------+
| 0|Hi I heard about ...| false|
| 4|I wish Java could...| true|
| 11|Logistic regressi...| true|
+---+--------------------+------------+
This is what I want, but I'd like to add it to a pipeline as a transformation step.