Python Library not recognized on Spark Cluster

Question

I have a Spark Dataframe which has a text data. I am trying to clean the html markups from the data using Python BeautifulSoup Library.

However, when I use BeautifulSoup on Spark installed locally on my Mac laptop, it works fine with a Spark udf and cleans up the markups.

from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

def html_parsing(x):
    """ Cleans the text from Data Frame text column"""

    textcleaned=''
    #if row['desc'] is not None: 
    souptext=BeautifulSoup(x)
    #souptext=BeautifulSoup(text)
    p_tags=souptext.find_all('p')
    for p in p_tags: 
        if p.string:
            textcleaned+=p.string
    #print textcleaned
    #ret_list= (int(row['id']),row['title'],textcleaned)

    return textcleaned


parse_html=udf(html_parsing,StringType())

sdf_cleaned=sdf_rss.dropna(subset=['desc']).withColumn('text_cleaned',parse_html('desc'))\
.select('id','title','text_cleaned')

sdf_cleaned.cache().take(3)

[Row(id=u'-33753621', title=u'Royal Bank of Scotland is testing a robot that could solve your banking problems (RBS)', text_cleaned=u"If you hate dealing with bank tellers or customer service representatives, then the Royal Bank of Scotland might have a solution for you.If this program is successful, it could be a big step forward on the road to automated customer service through the use of AI, notes Laurie Beaver, research associate for BI Intelligence, Business Insider's premium research service.It's noteworthy that Luvo does not operate via a third-party app such as Facebook Messenger, WeChat, or Kik, all of which are currently trying to create bots that would assist in customer service within their respective platforms.Luvo would be available through the web and through smartphones. It would also use machine learning to learn from its mistakes, which should ultimately help with its response accuracy.Down the road, Luvo would become a supplement to the human staff. It can currently answer 20 set questions but as that number grows, it would allow the human employees to more complicated issues. If a problem is beyond Luvo's comprehension, then it would refer the customer to a bank employee; however,\xa0a user could choose to speak with a human instead of Luvo anyway.AI such as Luvo, if successful, could help businesses become more efficient and increase their productivity, while simultaneously improving customer service capacity, which would consequently\xa0save money that would otherwise go toward manpower.And this trend is already starting. Google, Microsoft, and IBM are investing significantly into AI research. Furthermore, the global AI market is estimated to grow from approximately $420 million in 2014 to $5.05 billion in 2020, according to a forecast by Research and Markets.\xa0The move toward AI would be just one more way in which the digital age is disrupting retail banking. Customers, particularly millennials, are increasingly moving toward digital banking, and as a result, they're walking into their banks' traditional brick-and-mortar branches less often than ever before."),

However when I pull up Spark installed on a Cluster and use the same code , it says "No module named bs4". Ran the same code above in Anaconda jupyter notebook with pyspark kernel installed on Cluster.

Py4JJavaError: An error occurred while calling z:org.apache.spark.sql.execution.EvaluatePython.takeAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 3.0 failed 4 times, most recent failure: Lost task 0.3 in stage 3.0 (TID 9, 107-45-c02.sc): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/var/storage/nm-sda3/nm-local/usercache/appcache/application_1485803993783_0153/container_1485803993783_0153_01_000002/pyspark.zip/pyspark/worker.py", line 98, in main
    command = pickleSer._read_with_length(infile)
  File "/var/storage/nm-sda3/nm-local/usercache/appcache/application_1485803993783_0153/container_1485803993783_0153_01_000002/pyspark.zip/pyspark/serializers.py", line 164, in _read_with_length
    return self.loads(obj)
  File "/var/storage/nm-sda3/nm-local/usercache/appcache/application_1485803993783_0153/container_1485803993783_0153_01_000002/pyspark.zip/pyspark/serializers.py", line 422, in loads
    return pickle.loads(obj)
ImportError: No module named bs4

I want to highlight that the Spark cluster Anaconda also has BeautifulSoup installed and I confirmed it by doing

conda list

and shows the package there.

So what might be an issue here which I am missing?

Thanks a lot for any help

score 0 · Answer 1 · answered Nov 23 '21 at 17:02

I had the same issue and came to the understanding that even if you have required dependency in your conda/pip requirement file - it still does not mean all spark worker nodes will have that dependencies too. Therefore, you need to spread all dependencies across spark nodes too or at least make sure they already arrived at the start of the cluster.

Good examples were given in that reply: https://stackoverflow.com/a/49971939/2957102

But for your case with stripping data, probably you can try to find a way how to do it without beautifulsoup and it might be also as workaround. Like load data first and then do amendments with pandas or something else, for example.

In my case, there was UI button to enable particular library in databricks spark cluster on azure and it solved the ModuleNotFoundError: No module named 'bs4' issue. See picture:

P.S. Remember, your cluster should be online during dependencies installation on worker nodes.

Python Library not recognized on Spark Cluster

1 Answers1