I have to process millions of JSON files from Azure Blob Storage, each representing one row, and need to load them into Azure SQL DB with some minimal transformation in between. These files come in at random times but follow the same schema.
My first solution basically just created a DataFrame for each file and pushed it into SQL. This worked when we were receiving hundreds of files but now that we are received millions of files it is not scaling, taking over a day to process.
We also tried processing the files in Scala without Spark (see code below) but this is also too slow; 500 files processed in 8 minutes.
var sql_statement = ""
allFiles.par.map(file_name => {
//processing
val json = scala.io.Source.fromFile(file_name).mkString
val mapData1 = mapper.readValue(json, classOf[Map[String, Any]])
val account= mapData1("Contact").asInstanceOf[Map[String, Any]]
val common = account.keys.toList.intersect(srcDestMap .keys.toList)
val trMap=common.map(rec=>Map(srcDestMap(rec)->account(rec))).flatten.toMap
val vals=trMap.keys.toList.sorted.map(trMap(_).toString.replace("'", "''")).map("'"+_+"'")
//end processing
val cols="insert into dbo.Contact_VS(" + trMap.keys.toList.sorted.mkString(",") + ")" + " values (" + vals.mkString(",") + ")"
sql_statement = sql_statement + cols
})
val updated=statement.executeUpdate(sql_statement)
connection.close()
If anyone knows how to optimize this code, or any out-of-the-box thinking we could use to preprocess our JSON it would be greatly appreciated! The JSON is nested so it's a little more involved to merge everything into one large JSON to be read into Spark but we may have to go that way if no one has any better ideas.