I have a Spark Dataframe which has a text data. I am trying to clean the html markups from the data using Python BeautifulSoup Library.
However, when I use BeautifulSoup on Spark installed locally on my Mac laptop, it works fine with a Spark udf and cleans up the markups.
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType
def html_parsing(x):
""" Cleans the text from Data Frame text column"""
textcleaned=''
#if row['desc'] is not None:
souptext=BeautifulSoup(x)
#souptext=BeautifulSoup(text)
p_tags=souptext.find_all('p')
for p in p_tags:
if p.string:
textcleaned+=p.string
#print textcleaned
#ret_list= (int(row['id']),row['title'],textcleaned)
return textcleaned
parse_html=udf(html_parsing,StringType())
sdf_cleaned=sdf_rss.dropna(subset=['desc']).withColumn('text_cleaned',parse_html('desc'))\
.select('id','title','text_cleaned')
sdf_cleaned.cache().take(3)
[Row(id=u'-33753621', title=u'Royal Bank of Scotland is testing a robot that could solve your banking problems (RBS)', text_cleaned=u"If you hate dealing with bank tellers or customer service representatives, then the Royal Bank of Scotland might have a solution for you.If this program is successful, it could be a big step forward on the road to automated customer service through the use of AI, notes Laurie Beaver, research associate for BI Intelligence, Business Insider's premium research service.It's noteworthy that Luvo does not operate via a third-party app such as Facebook Messenger, WeChat, or Kik, all of which are currently trying to create bots that would assist in customer service within their respective platforms.Luvo would be available through the web and through smartphones. It would also use machine learning to learn from its mistakes, which should ultimately help with its response accuracy.Down the road, Luvo would become a supplement to the human staff. It can currently answer 20 set questions but as that number grows, it would allow the human employees to more complicated issues. If a problem is beyond Luvo's comprehension, then it would refer the customer to a bank employee; however,\xa0a user could choose to speak with a human instead of Luvo anyway.AI such as Luvo, if successful, could help businesses become more efficient and increase their productivity, while simultaneously improving customer service capacity, which would consequently\xa0save money that would otherwise go toward manpower.And this trend is already starting. Google, Microsoft, and IBM are investing significantly into AI research. Furthermore, the global AI market is estimated to grow from approximately $420 million in 2014 to $5.05 billion in 2020, according to a forecast by Research and Markets.\xa0The move toward AI would be just one more way in which the digital age is disrupting retail banking. Customers, particularly millennials, are increasingly moving toward digital banking, and as a result, they're walking into their banks' traditional brick-and-mortar branches less often than ever before."),
However when I pull up Spark installed on a Cluster and use the same code , it says "No module named bs4". Ran the same code above in Anaconda jupyter notebook with pyspark kernel installed on Cluster.
Py4JJavaError: An error occurred while calling z:org.apache.spark.sql.execution.EvaluatePython.takeAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 3.0 failed 4 times, most recent failure: Lost task 0.3 in stage 3.0 (TID 9, 107-45-c02.sc): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/var/storage/nm-sda3/nm-local/usercache/appcache/application_1485803993783_0153/container_1485803993783_0153_01_000002/pyspark.zip/pyspark/worker.py", line 98, in main
command = pickleSer._read_with_length(infile)
File "/var/storage/nm-sda3/nm-local/usercache/appcache/application_1485803993783_0153/container_1485803993783_0153_01_000002/pyspark.zip/pyspark/serializers.py", line 164, in _read_with_length
return self.loads(obj)
File "/var/storage/nm-sda3/nm-local/usercache/appcache/application_1485803993783_0153/container_1485803993783_0153_01_000002/pyspark.zip/pyspark/serializers.py", line 422, in loads
return pickle.loads(obj)
ImportError: No module named bs4
I want to highlight that the Spark cluster Anaconda also has BeautifulSoup installed and I confirmed it by doing
conda list
and shows the package there.
So what might be an issue here which I am missing?
Thanks a lot for any help