So, the basics are:
- I'm on Spark 2.+
- I'm running this all in a Jupyter notebook
- My goal is to iterate over a number of files in a directory and have spark (1) create dataframes and (2) turn those dataframes into sparkSQL tables. Basically, I want to be able to open the notebook at anytime and have a clean way of always loading everything available to me.
Below are my imports:
from pyspark.sql.functions import *
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
fileDirectory = 'data/'
Below is the actual code:
for fname in os.listdir(fileDirectory):
sqlContext.read.format("csv").\
option("header", "true").\
option("inferSchema", "true").\
load(fname)
df_app = app_dat_df
df_app.createOrReplaceTempView(fname)
But I'm getting the following error message:
AnalysisException: u'Unable to infer schema for CSV. It must be specified manually.;'
It would appear that it's not finding issue with the way I'm passing over the files (great), but it's not letting me infer schemas. When I manually go over each file, this has never been an issue.
Can someone give me some pointers on where I can improve them/get it to run?
Many, many thanks!