0

I am using Pyspark on collab (v - 3.1.2 with JDK 8). I am facing an error when I try to convert the .txt text file into a tuple based data format. Here is my code.

#reading data in 4 partitions and repartiotion the data in 6
captain_odi  = sc.textFile("/content/drive/MyDrive/PYspark/ODI data.csv",4,use_unicode=False)
captain_odi.take(10)

The output is :

[b',Player,Span,Mat,Inns,NO,Runs,HS,Ave,BF,SR,100,50,0,Unnamed: 13',
b'0,SR Tendulkar (INDIA),1989-2012,463,452,41,18426,200*,44.83,21367,86.23,49,96,20,',
b'1,KC Sangakkara (Asia/ICC/SL),2000-2015,404,380,41,14234,169,41.98,18048,78.86,25,93,15,',
b'2,RT Ponting (AUS/ICC),1995-2012,375,365,39,13704,164,42.03,17046,80.39,30,82,20,',
b'3,ST Jayasuriya (Asia/SL),1989-2011,445,433,18,13430,189,32.36,14725,91.2,28,68,34,']

Now when I make a tuple of the same data using the code below:

fields = ["name","country","career","matches","won","loss","ties","toss"]

from collections import namedtuple
Captain = namedtuple("Captian",fields)
def ParseReader(line):
  fields = line.split(",")
  return Captain(fields[0],fields[1],fields[2],(fields[3]),(fields[4]),(fields[5]),(fields[6]),(fields[7]))
captains  = captain_odi.map(lambda x : ParseReader(x))
captains.take(10)

I get the following error while performing .take() operation

---------------------------------------------------------------------------
Py4JJavaError                             Traceback (most recent call last)
<ipython-input-15-01599f2d7aa3> in <module>()
----> 1 captains.take(10)

3 frames
/content/spark-3.1.2-bin-hadoop3.2/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name)
    326                 raise Py4JJavaError(
    327                     "An error occurred while calling {0}{1}{2}.\n".
--> 328                     format(target_id, ".", name), value)
    329             else:
    330                 raise Py4JError(

Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 4.0 failed 1 times, most recent failure: Lost task 0.0 in stage 4.0 (TID 7) (786e4ea62989 executor driver): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/content/spark-3.1.2-bin-hadoop3.2/python/lib/pyspark.zip/pyspark/worker.py", line 604, in main
    process()
  File "/content/spark-3.1.2-bin-hadoop3.2/python/lib/pyspark.zip/pyspark/worker.py", line 596, in process
    serializer.dump_stream(out_iter, outfile)
  File "/content/spark-3.1.2-bin-hadoop3.2/python/lib/pyspark.zip/pyspark/serializers.py", line 259, in dump_stream
    vs = list(itertools.islice(iterator, batch))
  File "/content/spark-3.1.2-bin-hadoop3.2/python/pyspark/rdd.py", line 1560, in takeUpToNumLeft
    yield next(iterator)
  File "/content/spark-3.1.2-bin-hadoop3.2/python/lib/pyspark.zip/pyspark/util.py", line 73, in wrapper
    return f(*args, **kwargs)
  File "<ipython-input-8-f7a02b0b1d17>", line 10, in <lambda>
  File "<ipython-input-8-f7a02b0b1d17>", line 8, in ParseReader
TypeError: a bytes-like object is required, not 'str'

Can someone help m in finding out what is causing this error?

Steven
  • 14,048
  • 6
  • 38
  • 73
Arihant Kamdar
  • 124
  • 1
  • 7

1 Answers1

1

The error is :

TypeError: a bytes-like object is required, not 'str'

The output is :

b',Player,Span,Mat,Inns,NO,Runs,HS,Ave,BF,SR,100,50,0,Unnamed: 13',

Can you see the little b at the begining of your line ? it means that the type of this "string" is actually not string but byte.

What is the difference between a string and a byte string?

split is a string object method. You need first to apply the method decode using the proper encoding of your file (hopefuly utf8) before splitting.

You need to change this line :

fields = line.split(",")

with this one :

fields = line.decode("utf8").split(",")  # choose the proper encoding - here, I put utf8
Steven
  • 14,048
  • 6
  • 38
  • 73