pickle.PicklingError: args[0] from newobj args has the wrong class with hadoop python

Question

I am trying to I am tring to delete stop words via spark,the code is as follow

from nltk.corpus import stopwords
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession

sc = SparkContext('local')
spark = SparkSession(sc)
word_list=["ourselves","out","over", "own", "same" ,"shan't" ,"she", "she'd", "what", "the", "fuck", "is", "this","world","too","who","who's","whom","yours","yourself","yourselves"]

wordlist=spark.createDataFrame([word_list]).rdd

def stopwords_delete(word_list):
    filtered_words=[]
    print word_list



    for word in word_list:
        print word
        if word not in stopwords.words('english'):
            filtered_words.append(word)



filtered_words=wordlist.map(stopwords_delete)
print(filtered_words)

and I got the error as follow:

pickle.PicklingError: args[0] from newobj args has the wrong class

I don't know why,can somebody help me.
Thanks in advance

Hi, I am facing the same issue while using spark. Waiting for the solution. — Ravi Ranjan, Jul 12 '17 at 07:16

Shankar · Answer 1 · 2021-07-21T05:13:03.057

It's to do with uploading of stop words module. As a work around import stopwords library with in the function itself. please see the similar issue linked below. I had the same issue and this work around fixed the problem.

    def stopwords_delete(word_list):
        from nltk.corpus import stopwords
        filtered_words=[]
        print word_list

Similar Issue

I would recommend from pyspark.ml.feature import StopWordsRemover as permanent fix.

score 3 · Answer 2 · answered Nov 02 '18 at 18:56

3

Probably, it's just because you are defining the stopwords.words('english') every time on the executor. Define it outside and this would work.

answered Nov 02 '18 at 18:56

Abhishek Gupta

31
1
2

4

It helps others if you provide example code demonstrating that this is the correct answer. – anothermh Nov 02 '18 at 22:02

Suresh · Answer 3 · 2017-07-13T13:06:47.697

1

You are using map over a rdd which has only one row and each word as a column.so, the entire row of rdd which is of type is passed to stopwords_delete fuction and in the for loop within that, is trying to match rdd to stopwords and it fails.Try like this,

filtered_words=stopwords_delete(wordlist.flatMap(lambda x:x).collect())
print(filtered_words)

I got this output as filtered_words,

["shan't", "she'd", 'fuck', 'world', "who's"]

Also, include a return in your function.

Another way, you could use list comprehension to replace the stopwords_delete fuction,

filtered_words = wordlist.flatMap(lambda x:[i for i in x if i not in stopwords.words('english')]).collect()

edited Jul 13 '17 at 13:06

answered Jul 13 '17 at 12:47

Suresh

5,678
2
24
40

Hi Suresh,thank you for your answer,filtered_words = wordlist.flatMap(lambda x:[i for i in x if i not in stopwords.words('english')]).collect(),this way still gives me the same error,the first one works,but the return type is not a rdd,can you help me ? – Tiana Jul 17 '17 at 14:34
return type will be a list. we are creating a list and appending the words to them in the function.you need it to be rdd ?? – Suresh Jul 17 '17 at 15:29

score 0 · Answer 4 · answered Apr 27 '22 at 11:50

0

the problem is related to stopwords.words('english') line, you need to determine it in a stable way

answered Apr 27 '22 at 11:50

Zak_Stack

103
8

pickle.PicklingError: args[0] from newobj args has the wrong class with hadoop python

4 Answers4

Linked

pickle.PicklingError: args[0] from __newobj__ args has the wrong class with hadoop python

4 Answers4

Linked

pickle.PicklingError: args[0] from newobj args has the wrong class with hadoop python