i have GBs of data in s3 and trying to bring parallelism while reading in my code by refering the following Link .
I am using the below code as a sample but when i run,it runs down to the following error:
Anyhelp on this is deeply appreciated as i am very new to spark.
EDIT : I have to read my s3 files using parallelism which is not explained in any post. People marking duplicate please read the problem first.
PicklingError: Could not serialize object: Exception: It appears that you are attempting to reference SparkContext from a broadcast variable, action, or transformation. SparkContext can only be used on the driver, not in code that it run on workers. For more information, see SPARK-5063.
class telco_cn:
def __init__(self, sc):
self.sc = sc
def decode_module(msg):
df=spark.read.json(msg)
return df
def consumer_input(self, sc, k_topic):
a = sc.parallelize(['s3://bucket1/1575158401-51e09537-0ce5-c775-6beb-fd1b0a568e15.json'])
d = a.map(lambda x: telco_cn.decode_module(x)).collect()
print (d)
if __name__ == "__main__":
cn = telco_cn(sc)
cn.consumer_input(sc, '')