1

What is the best way to read a csv formatted result from a rest api directly into spark?

Basically have this which I know I can process in scala and save to a file but would like to process the data in spark:

val resultCsv = scala.io.Source.fromURL(url).getLines()
horatio1701d
  • 8,809
  • 14
  • 48
  • 77
  • You have to read the file into memory/disk before Spark can do anything with it, so what you have is the only option. Whether you put that in a Spark executor is up to you – OneCricketeer Jul 07 '17 at 02:03
  • Thanks. I'm okay with putting it into memory instead of disk. Is there an idiomatic transition from the csv object returned into an RDD/DF you could recommend? – horatio1701d Jul 07 '17 at 02:13

1 Answers1

3

This is how it can be done.

For Spark 2.2.x

import scala.io.Source._
import org.apache.spark.sql.{Dataset, SparkSession}

var res = fromURL(url).mkString.stripMargin.lines.toList
val csvData: Dataset[String] = spark.sparkContext.parallelize(res).toDS()

val frame = spark.read.option("header", true).option("inferSchema",true).csv(csvData)
frame.printSchema()

using databricks lib for older version of Spark

import scala.io.Source._
import com.databricks.spark.csv.CsvParser

var res = fromURL(url).mkString.stripMargin.lines.toList
val csvData: Dataset[String] = spark.sparkContext.parallelize(res)

val csvParser = new CsvParser()
  .withUseHeader(true)
  .withInferSchema(true)

val frame: DataFrame = csvParser.csvRdd(sqlContext, rdd)
frame.printSchema()

Note:- I am new to Scala any improvements will be appreciated.

ref: here

batman
  • 267
  • 2
  • 11
  • 1
    How do you consume results from a rest web service that requires authentication? For example, a servicenow table needs to be pulled and it has 1000000 records. each api call to servicenow returns 1000 results. So, to get the entire dataset you need to do multiple calls. I am trying to create a datasourcev2, but after look at that there isn't that much documentation. Do you have any suggestions? – pitchblack408 May 14 '19 at 17:51