8

I am trying to I am tring to delete stop words via spark,the code is as follow

from nltk.corpus import stopwords
from pyspark.context import SparkContext
from pyspark.sql.session import SparkSession

sc = SparkContext('local')
spark = SparkSession(sc)
word_list=["ourselves","out","over", "own", "same" ,"shan't" ,"she", "she'd", "what", "the", "fuck", "is", "this","world","too","who","who's","whom","yours","yourself","yourselves"]

wordlist=spark.createDataFrame([word_list]).rdd

def stopwords_delete(word_list):
    filtered_words=[]
    print word_list



    for word in word_list:
        print word
        if word not in stopwords.words('english'):
            filtered_words.append(word)



filtered_words=wordlist.map(stopwords_delete)
print(filtered_words)

and I got the error as follow:

pickle.PicklingError: args[0] from newobj args has the wrong class

I don't know why,can somebody help me.
Thanks in advance

Rayhane Mama
  • 2,374
  • 11
  • 20
Tiana
  • 81
  • 1
  • 5

4 Answers4

6

It's to do with uploading of stop words module. As a work around import stopwords library with in the function itself. please see the similar issue linked below. I had the same issue and this work around fixed the problem.

    def stopwords_delete(word_list):
        from nltk.corpus import stopwords
        filtered_words=[]
        print word_list

Similar Issue

I would recommend from pyspark.ml.feature import StopWordsRemover as permanent fix.

Shankar
  • 571
  • 14
  • 26
3

Probably, it's just because you are defining the stopwords.words('english') every time on the executor. Define it outside and this would work.

1

You are using map over a rdd which has only one row and each word as a column.so, the entire row of rdd which is of type is passed to stopwords_delete fuction and in the for loop within that, is trying to match rdd to stopwords and it fails.Try like this,

filtered_words=stopwords_delete(wordlist.flatMap(lambda x:x).collect())
print(filtered_words)

I got this output as filtered_words,

["shan't", "she'd", 'fuck', 'world', "who's"]

Also, include a return in your function.

Another way, you could use list comprehension to replace the stopwords_delete fuction,

filtered_words = wordlist.flatMap(lambda x:[i for i in x if i not in stopwords.words('english')]).collect()
Suresh
  • 5,678
  • 2
  • 24
  • 40
  • Hi Suresh,thank you for your answer,filtered_words = wordlist.flatMap(lambda x:[i for i in x if i not in stopwords.words('english')]).collect(),this way still gives me the same error,the first one works,but the return type is not a rdd,can you help me ? – Tiana Jul 17 '17 at 14:34
  • return type will be a list. we are creating a list and appending the words to them in the function.you need it to be rdd ?? – Suresh Jul 17 '17 at 15:29
0

the problem is related to stopwords.words('english') line, you need to determine it in a stable way

Zak_Stack
  • 103
  • 8