I have a pyspark dataframe like this:
+--------------------+--------------------+
| label| sentences|
+--------------------+--------------------+
|[things, we, eati...|<p>I am construct...|
|[elephants, nordi...|<p><strong>Edited...|
|[bee, cross-entro...|<p>I have a data ...|
|[milking, markers...|<p>There is an Ma...|
|[elephants, tease...|<p>I have Score d...|
|[references, gene...|<p>I'm looking fo...|
|[machines, exitin...|<p>I applied SVM ...|
+--------------------+--------------------+
And a top_ten
list like this:
['bee', 'references', 'milking', 'expert', 'bombardier', 'borscht', 'distributions', 'wires', 'keyboard', 'correlation']
And I need to create a new_label
column indicating 1.0
if at least one of the label values exists in the top_ten
list (for each row, of course).
While the logic makes sense, my inexperience with the syntax is showing. Surely there's a short-ish answer to this problem?
I've tried:
temp = train_df.withColumn('label', F.when(lambda x: x.isin(top_ten), 1.0).otherwise(0.0))
and this:
def matching_top_ten(top_ten, labels):
for label in labels:
if label.isin(top_ten):
return 1.0
else:
return 0.0
I found out after this last attempt that these functions can't be mapped to a dataframe. So I guess I could convert the column to an RDD, map it, and then .join()
it back, but that sounds unnecessarily tedious.
**Update:**Tried the above function as a UDF with no luck as well...
from pyspark.sql.functions import udf
from pyspark.sql.types import FloatType
matching_udf = udf(matching_top_ten, FloatType())
temp = train_df.select('label', matching_udf(top_ten, 'label').alias('new_labels'))
----
TypeError: Invalid argument, not a string or column: [...top_ten list values...] of type <class 'list'>. For column literals, use 'lit', 'array', 'struct' or 'create_map' function.
There are other similar questions I've found on SO, however, none of them involve the logic of verifying a list against another list (at best, a single value against a list).