I am going through a tutorial for pyspark and have the following code:
*
%%pyspark
people = spark.read.option("header", True).option("inferSchema",True).csv("abfss://user-797427@strprmsandboxpoc001.blob.core.windows.net/pysparkCourse/fakefriends-header.csv")
This gives the below error:
AnalysisException: Unable to infer schema for CSV. It must be specified manually. Traceback (most recent call last):
File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 737, in csv return self._df(self._jreader.csv(self._spark._sc._jvm.PythonUtils.toSeq(path)))
File "/home/trusted-service-user/cluster-env/env/lib/python3.8/site-packages/py4j/java_gateway.py", line 1304, in call return_value = get_return_value(
File "/opt/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 117, in deco raise converted from None
pyspark.sql.utils.AnalysisException: Unable to infer schema for CSV. It must be specified manually.
So I tried specifying the schema using the code below. This does not error out, but returns an empty dataframe with the specified schema structure:
%%pyspark
from pyspark.sql.types import StructType, StructField, IntegerType, StringType
#For specifying Schema
custSchema = StructType([ StructField('userID', StringType(), True),
StructField('name', StringType(), True),
StructField('age', StringType(), True),
StructField('friends', StringType(), True)])
people = spark.read.format("csv").option("delimiter",",").option("header", "true").schema(custSchema).load("abfss://user-797427@strprmsandboxpoc001.blob.core.windows.net/pysparkCourse/fakefriends-header.csv")
Output is:
I know there is no issue with the file since I am able to read it into an RDD without an issue with the following block of code:
%%pyspark
lines = sc.textFile("abfss://user-797427@strprmsandboxpoc001.blob.core.windows.net/pysparkCourse/fakefriends-header.csv")
The file is a simple CSV file that looks like below:
userID,name,age,friends
0,Will,33,385
1,Jean-Luc,26,2
2,Hugh,55,221
3,Deanna,40,465
4,Quark,68,21
5,Weyoun,59,318
6,Gowron,37,220
7,Will,54,307
8,Jadzia,38,380
9,Hugh,27,181
10,Odo,53,191
11,Ben,57,372
12,Keiko,54,253
13,Jean-Luc,56,444
Note: I also tried reading it with spark.read.text, but again it does not error out, but no rows are read in from the file. Somehow I am able to read the file into an RDD, but not into a dataframe. Tried a different file, same issue.
What could I be missing?
Thank you
Vidya