I did some research on this during the past couple days and I think I'm close to getting this working, but there are still some issues that I can't quite figure out.
I believe this should work in a Scala environment
// Spark 2.0
// these lines are equivalent in Spark 2.0
spark.read.format("csv").option("header", "false").load("../Downloads/*.csv")
spark.read.option("header", "false").csv("../Downloads/*.csv")
That give me this error: org.apache.spark.sql.AnalysisException: Path does not exist:
I think this should work in a SQL environment:
df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "false")
.load("../Downloads/*.csv") // <-- note the star (*)
df.show()
This gives me a parse exception error.
The thing is, these are all .gz
zipped text files and there is really no schema in all these files. Well, there is a vertical list of field names, and the real data sets always start down on something like row 26, 52, 99, 113, 149, and all kinds of random things. All data is pipe-delimited. I have the field names and I created structured tables in Azure SQL Server, which is where I want to store all data. I'm really stuck on how to iterate through folders and sub-folders, look for file names that match certain patterns, and merge all of these into a dataframe, then push that object into my SQL Server tables. It seems like a pretty straightforward thing, but I can't seem to get this darn thing working!!
I came across the idea here:
https://stackoverflow.com/questions/37639956/how-to-import-multiple-csv-files-in-a-single-load