0

I am going through a tutorial for pyspark and have the following code:

*

%%pyspark
people = spark.read.option("header", True).option("inferSchema",True).csv("abfss://user-797427@strprmsandboxpoc001.blob.core.windows.net/pysparkCourse/fakefriends-header.csv")

This gives the below error:

AnalysisException: Unable to infer schema for CSV. It must be specified manually. Traceback (most recent call last):

File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 737, in csv return self._df(self._jreader.csv(self._spark._sc._jvm.PythonUtils.toSeq(path)))

File "/home/trusted-service-user/cluster-env/env/lib/python3.8/site-packages/py4j/java_gateway.py", line 1304, in call return_value = get_return_value(

File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 117, in deco raise converted from None

pyspark.sql.utils.AnalysisException: Unable to infer schema for CSV. It must be specified manually.

So I tried specifying the schema using the code below. This does not error out, but returns an empty dataframe with the specified schema structure:

%%pyspark

from pyspark.sql.types import StructType, StructField, IntegerType, StringType

#For specifying Schema 
custSchema = StructType([    StructField('userID', StringType(), True),
    StructField('name', StringType(), True),
    StructField('age', StringType(), True),
    StructField('friends', StringType(), True)]) 

people = spark.read.format("csv").option("delimiter",",").option("header", "true").schema(custSchema).load("abfss://user-797427@strprmsandboxpoc001.blob.core.windows.net/pysparkCourse/fakefriends-header.csv")

Output is:

Output of above code

I know there is no issue with the file since I am able to read it into an RDD without an issue with the following block of code:

%%pyspark

lines = sc.textFile("abfss://user-797427@strprmsandboxpoc001.blob.core.windows.net/pysparkCourse/fakefriends-header.csv")

The file is a simple CSV file that looks like below:

userID,name,age,friends
0,Will,33,385
1,Jean-Luc,26,2
2,Hugh,55,221
3,Deanna,40,465
4,Quark,68,21
5,Weyoun,59,318
6,Gowron,37,220
7,Will,54,307
8,Jadzia,38,380
9,Hugh,27,181
10,Odo,53,191
11,Ben,57,372
12,Keiko,54,253
13,Jean-Luc,56,444

Note: I also tried reading it with spark.read.text, but again it does not error out, but no rows are read in from the file. Somehow I am able to read the file into an RDD, but not into a dataframe. Tried a different file, same issue.

What could I be missing?

Thank you

Vidya

thebluephantom
  • 16,458
  • 8
  • 40
  • 83
Vid
  • 1

1 Answers1

0

You need to specify delimiter when reading CSV file, in your case, it is a comma(,)

%%pyspark people = spark.read.option("header", True).option("inferSchema",True).csv("abfss://user-797427@strprmsandboxpoc001.blob.core.windows.net/pysparkCourse/fakefriends-header.csv")

  • I tried adding a delimiter option as follows: people = spark.read.option("header", True).option("delimiter", ",").option("inferSchema",True).csv("abfss://user-797427@strprmsandboxpoc001.blob.core.windows.net/pysparkCourse/fakefriends-header.csv") Still getting the same error. – Vid Jan 26 '22 at 19:44