0

I did some research on this during the past couple days and I think I'm close to getting this working, but there are still some issues that I can't quite figure out.

I believe this should work in a Scala environment

// Spark 2.0
// these lines are equivalent in Spark 2.0
spark.read.format("csv").option("header", "false").load("../Downloads/*.csv")
spark.read.option("header", "false").csv("../Downloads/*.csv")

That give me this error: org.apache.spark.sql.AnalysisException: Path does not exist:

I think this should work in a SQL environment:

df = sqlContext.read
       .format("com.databricks.spark.csv")
       .option("header", "false")
       .load("../Downloads/*.csv") // <-- note the star (*)
df.show()

This gives me a parse exception error.

The thing is, these are all .gz zipped text files and there is really no schema in all these files. Well, there is a vertical list of field names, and the real data sets always start down on something like row 26, 52, 99, 113, 149, and all kinds of random things. All data is pipe-delimited. I have the field names and I created structured tables in Azure SQL Server, which is where I want to store all data. I'm really stuck on how to iterate through folders and sub-folders, look for file names that match certain patterns, and merge all of these into a dataframe, then push that object into my SQL Server tables. It seems like a pretty straightforward thing, but I can't seem to get this darn thing working!!

I came across the idea here:

https://stackoverflow.com/questions/37639956/how-to-import-multiple-csv-files-in-a-single-load

ASH
  • 20,759
  • 19
  • 87
  • 200

2 Answers2

0

you can find all files with pure scala and then pass them to spark:

val file = new File(yourDirectory)
val files: List[String] = file.listFiles
.filter(_.isFile)
.filter(_.getName.startsWith("yourCondition"))
.map(_.getPath).toList

val df = spark.read.csv(files:_*)
chlebek
  • 2,431
  • 1
  • 8
  • 20
  • I was afraid of this. I got an error on the very first line. It looks like this: val file = new File("/rawdata/2019/01/01/client/") The error message is: notebook:3: error: not found: type File val file = new File("/rawdata/2019/01/01/client/") – ASH Oct 09 '19 at 01:21
  • It almost seems like there is some kind of firewall blocking this. I've tried many versions of sample code that I found online. Absolutely nothing works for me. It either can't mount files, or it can't find files, or it doesn't recognize the path to the Lake, or some such thing. – ASH Oct 09 '19 at 01:23
0

I finally, finally, finally got this working.

val myDFCsv = spark.read.format("csv")
   .option("sep","|")
   .option("inferSchema","true")
   .option("header","false")
   .load("mnt/rawdata/2019/01/01/client/ABC*.gz")

myDFCsv.show()
myDFCsv.count()
ASH
  • 20,759
  • 19
  • 87
  • 200