How to apply nltk.pos_tag on pyspark dataframe

Question

I'm trying to apply pos tagging on one of my tokenized column called "removed" in pyspark dataframe.

I'm trying with

nltk.pos_tag(df_removed.select("removed"))

But all I get is Value Error: ValueError: Cannot apply 'in' operator against a column: please use 'contains' in a string column or 'array_contains' function for an array column.

How can I make it?

Catalina Chircu · Accepted Answer · 2020-04-08T10:24:35.533

0

It seems the answer is in the error message: the input of pos_tag should be a string and you provide a column input. You should apply pos_tag on each row of you column, using the function withColumn

For example you start by writing:

my_new_df = df_removed.withColumn("removed", nltk.pos_tag(df_removed.removed))

You can do also :

my_new_df = df_removed.select("removed").rdd.map(lambda x: nltk.pos_tag(x)).toDF()

Here you have the documentation.

edited Apr 08 '20 at 10:24

answered Mar 30 '20 at 10:33

Catalina Chircu

1,506
2
8
19

1

Thanks, but I have an error that `DataFrame' object has no attribute 'map'` `df_removed.select("removed").map(nltk.pos_tag)` – milva Mar 30 '20 at 10:45
1

Could you please tell me what "rdd" is changing in that code? I'm new at pyspark and I would like to understand it. – milva Apr 06 '20 at 08:46
1

And unfortunately this code does not work for me :( I got Py4Java Error – milva Apr 06 '20 at 08:47
RDD is a class which allows to perform operations. What is your error exactly? – Catalina Chircu Apr 06 '20 at 09:24
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 18.0 failed 1 times, most recent failure: Lost task 0.0 in stage 18.0 (TID 483, localhost, executor driver): org.apache.spark.SparkException: Python worker failed to connect back. – milva Apr 06 '20 at 09:59
I see it is the name problem - OSError: [WinError 123] – milva Apr 06 '20 at 10:24
Here : https://stackoverflow.com/questions/21115580/python-os-rename-oserror-winerror-123 – Catalina Chircu Apr 06 '20 at 10:52
Unfortunately this case does not resolve mine - I don't know why my spark reads such a line: `'C:\\C:\\Users\\Olga\\Desktop\\Spark\\spark-2.4.5-bin-hadoop2.7\\jars\\spark-core_2.11-2.4.5.jar'` So it gets two "C:\\" – milva Apr 06 '20 at 11:09
Just add all your code. I see you ar on Windows as OS. – Catalina Chircu Apr 06 '20 at 11:13
I asked about it here https://stackoverflow.com/questions/61059445/how-to-fix-pyspark-oserror-winerror-123 – milva Apr 06 '20 at 12:11
I updated the answer with a function more appropriate for string parsing. – Catalina Chircu Apr 08 '20 at 06:31
What's (x) in `my_new_df = df_removed.withColumn("removed", nltk.pos_tag(x))`? – milva Apr 08 '20 at 08:22
Sorry, I missed that. Corrected. Check also the documentation I posted. – Catalina Chircu Apr 08 '20 at 10:25
Hi, I've tried with `my_new_df = df_removed.select("removed").rdd.map(lambda x: nltk.pos_tag(x)).toDF()` but I got: `AttributeError: 'list' object has no attribute 'isdigit` – milva Apr 09 '20 at 13:43
Can you display the column `removed` of your DataFrame? I guess it must be a list. You need a string. – Catalina Chircu Apr 09 '20 at 14:04
Now I found it - I just needed to firstly make `lambda x:x[0] ` and then apply pos_tag. Thank you for help!!!! Your answer really help me :) – milva Apr 10 '20 at 16:48
I'm glad I could help! I couldn't have found the error as you I did not know the format of your input. – Catalina Chircu Apr 10 '20 at 17:24

How to apply nltk.pos_tag on pyspark dataframe

1 Answers1