Reading csv files to create dataframe

Question

I am trying to read a csv file to create a dataframe (https://databricks.com/blog/2015/02/17/introducing-dataframes-in-spark-for-large-scale-data-science.html)

Using:

spark-1.3.1-bin-hadoop2.6
spark-csv_2.11-1.1.0

Code:

import org.apache.spark.sql.SQLContext
object test {
 def main(args: Array[String]) {
       val conf = new SparkConf().setAppName("test")
       val sc = new SparkContext(conf)
       val sqlContext = new SQLContext(sc)
       val df = sqlContext.csvFile("filename.csv")
       ...
 }
}

Error:

value csvFile is not a member of org.apache.spark.sql.SQLContext

I was trying to do as advised here: Spark - load CSV file as DataFrame?

But sqlContext doesn't seem to recognize the csvFile method of CsvContext class.

Any advise would be appreciated!

You need to look into this project https://github.com/databricks/spark-csv — Zia Kiyani, Jun 17 '15 at 12:10
otherwise you can convert it manually, mentioned https://spark.apache.org/docs/latest/sql-programming-guide.html#interoperating-with-rdds — Zia Kiyani, Jun 17 '15 at 12:13

score 0 · Answer 1 · answered Jun 16 '15 at 18:28

0

I am also facing some issues with CSV(without Spark-CSV) but here is somethings that you can look at and check if they are OK.

Build the Spark shell with the spark-csv library using sbt assembly.
Add the spark-csv dependency to POM.XML of you maven project.
use the load/save methods of Dataframe API.

SPARK-CSV GITHUB

refer the spark-csv github readme.md page and you will up and running :)

answered Jun 16 '15 at 18:28

Chetandalal

674
1
7
18

**1&2.** I already have **"com.databricks" % "spark-csv_2.11" % "1.1.0"** in my sbt and when compiling it does not complain about missing dependencies. **3.** How will the load/save methods help me read the csv? As for readme.md page, I have already read that, it advises to use sqlContext.load which compiles but at runtime results in "filename.csv" is not a Parquet file. That is why I had followed the advice from here: http://stackoverflow.com/questions/29704333/how-to-read-csv-file-as-dataframe – user3376961 Jun 16 '15 at 18:43
If you want to to save CSV as parquet then use the dataframe created HashMap options = new HashMap(); options.put("header", "true"); options.put("path", dataFile); DataFrame df = sqlContext.load("com.databricks.spark.csv", options);df.save(,"parquet"); – Chetandalal Jun 16 '15 at 19:05

Reading csv files to create dataframe

1 Answers1