I am using Pyspark on collab (v - 3.1.2 with JDK 8). I am facing an error when I try to convert the .txt text file into a tuple based data format. Here is my code.
#reading data in 4 partitions and repartiotion the data in 6
captain_odi = sc.textFile("/content/drive/MyDrive/PYspark/ODI data.csv",4,use_unicode=False)
captain_odi.take(10)
The output is :
[b',Player,Span,Mat,Inns,NO,Runs,HS,Ave,BF,SR,100,50,0,Unnamed: 13',
b'0,SR Tendulkar (INDIA),1989-2012,463,452,41,18426,200*,44.83,21367,86.23,49,96,20,',
b'1,KC Sangakkara (Asia/ICC/SL),2000-2015,404,380,41,14234,169,41.98,18048,78.86,25,93,15,',
b'2,RT Ponting (AUS/ICC),1995-2012,375,365,39,13704,164,42.03,17046,80.39,30,82,20,',
b'3,ST Jayasuriya (Asia/SL),1989-2011,445,433,18,13430,189,32.36,14725,91.2,28,68,34,']
Now when I make a tuple of the same data using the code below:
fields = ["name","country","career","matches","won","loss","ties","toss"]
from collections import namedtuple
Captain = namedtuple("Captian",fields)
def ParseReader(line):
fields = line.split(",")
return Captain(fields[0],fields[1],fields[2],(fields[3]),(fields[4]),(fields[5]),(fields[6]),(fields[7]))
captains = captain_odi.map(lambda x : ParseReader(x))
captains.take(10)
I get the following error while performing .take() operation
---------------------------------------------------------------------------
Py4JJavaError Traceback (most recent call last)
<ipython-input-15-01599f2d7aa3> in <module>()
----> 1 captains.take(10)
3 frames
/content/spark-3.1.2-bin-hadoop3.2/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
326 raise Py4JJavaError(
327 "An error occurred while calling {0}{1}{2}.\n".
--> 328 format(target_id, ".", name), value)
329 else:
330 raise Py4JError(
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 4.0 failed 1 times, most recent failure: Lost task 0.0 in stage 4.0 (TID 7) (786e4ea62989 executor driver): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/content/spark-3.1.2-bin-hadoop3.2/python/lib/pyspark.zip/pyspark/worker.py", line 604, in main
process()
File "/content/spark-3.1.2-bin-hadoop3.2/python/lib/pyspark.zip/pyspark/worker.py", line 596, in process
serializer.dump_stream(out_iter, outfile)
File "/content/spark-3.1.2-bin-hadoop3.2/python/lib/pyspark.zip/pyspark/serializers.py", line 259, in dump_stream
vs = list(itertools.islice(iterator, batch))
File "/content/spark-3.1.2-bin-hadoop3.2/python/pyspark/rdd.py", line 1560, in takeUpToNumLeft
yield next(iterator)
File "/content/spark-3.1.2-bin-hadoop3.2/python/lib/pyspark.zip/pyspark/util.py", line 73, in wrapper
return f(*args, **kwargs)
File "<ipython-input-8-f7a02b0b1d17>", line 10, in <lambda>
File "<ipython-input-8-f7a02b0b1d17>", line 8, in ParseReader
TypeError: a bytes-like object is required, not 'str'
Can someone help m in finding out what is causing this error?