-2

How do I read the data from hdfs data sets using scala language? data is any "CSV" file with limited records.

Cœur
  • 37,241
  • 25
  • 195
  • 267
Sunitha
  • 135
  • 1
  • 2
  • 12

2 Answers2

2

You tagged the question with Spark, so I'm assuming you are trying to use that. I would recommend you start by reading through the Spark documentation here to get an idea of how to use Spark to interact with your data.

https://spark.apache.org/docs/latest/quick-start.html

https://spark.apache.org/docs/latest/sql-programming-guide.html

But, to answer your specific question, in Spark you would read in the CSV file using code like this:

val csvDf = spark.read.format("csv")
  .option("sep", ",")
  .option("header", "true")
  .load("hdfs://some/path/to/data.csv/")

The path your provide will be to a CSV file on HDFS, or a folder containing multiple CSV files. Also, Spark will accept other types of file systems. For example you could also use "file://" to access the local file system, or "s3://" to use S3. Once you have loaded the data you will have a Spark DataFrame object with SQL like methods available to interact with it.

Note, I provided an option for separator just to show you how to do it, but it defaults to "," anyways, so it is not required. Also, if your CSV files do not include a header, you will need to specify the Schema yourself and set header to false instead.

Ryan Widmaier
  • 7,948
  • 2
  • 30
  • 32
0

You can read data from HDFS by following this approach :-

val hdfs = FileSystem.get(new URI("hdfs://hdfsUrl:port/"), new Configuration()) 
val path = new Path("/pathOfTheFileInHDFS/")
val stream = hdfs.open(path)
def readLines = Stream.cons(stream.readLine, Stream.continually( stream.readLine))

//This example checks line for null and prints every existing line consequentally readLines.takeWhile(_ != null).foreach(line => println(line))

Also please have a look at this article https://blog.matthewrathbone.com/2013/12/28/reading-data-from-hdfs-even-if-it-is-compressed

Please let me know if this answers your question.

Chaitanya
  • 3,590
  • 14
  • 33