Process CSV from REST API into Spark

Question

What is the best way to read a csv formatted result from a rest api directly into spark?

Basically have this which I know I can process in scala and save to a file but would like to process the data in spark:

val resultCsv = scala.io.Source.fromURL(url).getLines()

You have to read the file into memory/disk before Spark can do anything with it, so what you have is the only option. Whether you put that in a Spark executor is up to you — OneCricketeer, Jul 07 '17 at 02:03
Thanks. I'm okay with putting it into memory instead of disk. Is there an idiomatic transition from the csv object returned into an RDD/DF you could recommend? — horatio1701d, Jul 07 '17 at 02:13

score 3 · Accepted Answer · answered Mar 25 '19 at 07:19

This is how it can be done.

For Spark 2.2.x

import scala.io.Source._
import org.apache.spark.sql.{Dataset, SparkSession}

var res = fromURL(url).mkString.stripMargin.lines.toList
val csvData: Dataset[String] = spark.sparkContext.parallelize(res).toDS()

val frame = spark.read.option("header", true).option("inferSchema",true).csv(csvData)
frame.printSchema()

using databricks lib for older version of Spark

import scala.io.Source._
import com.databricks.spark.csv.CsvParser

var res = fromURL(url).mkString.stripMargin.lines.toList
val csvData: Dataset[String] = spark.sparkContext.parallelize(res)

val csvParser = new CsvParser()
  .withUseHeader(true)
  .withInferSchema(true)

val frame: DataFrame = csvParser.csvRdd(sqlContext, rdd)
frame.printSchema()

Note:- I am new to Scala any improvements will be appreciated.

ref: here

How do you consume results from a rest web service that requires authentication? For example, a servicenow table needs to be pulled and it has 1000000 records. each api call to servicenow returns 1000 results. So, to get the entire dataset you need to do multiple calls. I am trying to create a datasourcev2, but after look at that there isn't that much documentation. Do you have any suggestions? — pitchblack408, May 14 '19 at 17:51

Process CSV from REST API into Spark

1 Answers1

Linked