4

I was using json scala library to parse a json from a local drive in spark job :

val requestJson=JSON.parseFull(Source.fromFile("c:/data/request.json").mkString)
    val mainJson=requestJson.get.asInstanceOf[Map[String,Any]].get("Request").get.asInstanceOf[Map[String,Any]]
    val currency=mainJson.get("currency").get.asInstanceOf[String]

But when i try to use the same parser by pointing to hdfs file location it doesnt work:

val requestJson=JSON.parseFull(Source.fromFile("hdfs://url/user/request.json").mkString)

and gives me an error:

java.io.FileNotFoundException: hdfs:/localhost/user/request.json (No such file or directory)
  at java.io.FileInputStream.open0(Native Method)
  at java.io.FileInputStream.open(FileInputStream.java:195)
  at java.io.FileInputStream.<init>(FileInputStream.java:138)
  at scala.io.Source$.fromFile(Source.scala:91)
  at scala.io.Source$.fromFile(Source.scala:76)
  at scala.io.Source$.fromFile(Source.scala:54)
  ... 128 elided

How can i use Json.parseFull library to get data from hdfs file location ?

Thanks

baiduXiu
  • 167
  • 1
  • 3
  • 15
  • You should provide` hdfs` location like this `hdfs://cluster_name/path/to/file'` or simply give directory name like `/path/to/file/`. Plz try and let me know I will ans accordingly. – Sandeep Singh Jan 04 '17 at 03:29
  • yeah i tried giving the hdfs path to Source.fromFile api but doesnt work – baiduXiu Jan 04 '17 at 03:52
  • Could you be able to post error log ? – Sandeep Singh Jan 04 '17 at 03:59
  • java.io.FileNotFoundException: hdfs:/hdfsurl/user/request.json (No such file or directory) at java.io.FileInputStream.open0(Native Method) at java.io.FileInputStream.open(FileInputStream.java:195) at java.io.FileInputStream.(FileInputStream.java:138) at scala.io.Source$.fromFile(Source.scala:91) at scala.io.Source$.fromFile(Source.scala:76) at scala.io.Source$.fromFile(Source.scala:54) ... 128 elided – baiduXiu Jan 04 '17 at 04:44

3 Answers3

1

Spark does have an inbuilt support for JSON documents parsing which will be available in spark-sql_${scala.version} jar.

In Spark 2.0+ :

import org.apache.spark.sql.SparkSession 

val spark: SparkSession = SparkSession.builder.master("local").getOrCreate

val df = spark.read.format("json").json("json/file/location/in/hdfs")

df.show()

with df object you can do all supported SQL operations on it and it's data processing will be distributed among the nodes whereas requestJson will be computed in single machine only.

Maven dependencies

<dependency>
    <groupId>org.apache.spark</groupId>
    <artifactId>spark-sql_2.11</artifactId>
    <version>2.0.0</version>
</dependency>

Edit: (as per comment to read file from hdfs)

val hdfs = org.apache.hadoop.fs.FileSystem.get(
             new java.net.URI("hdfs://ITS-Hadoop10:9000/"), 
             new org.apache.hadoop.conf.Configuration()
           )
val path=new Path("/user/zhc/"+x+"/")
val t=hdfs.listStatus(path)
val in =hdfs.open(t(0).getPath)
val reader = new BufferedReader(new InputStreamReader(in))
var l=reader.readLine()

code credits: from another SO question

Maven dependencies:

<dependency>
    <groupId>org.apache.hadoop</groupId>
    <artifactId>hadoop-hdfs</artifactId>
    <version>2.7.2</version> <!-- you can change this as per your hadoop version -->
</dependency>
Community
  • 1
  • 1
mrsrinivas
  • 34,112
  • 13
  • 125
  • 125
  • the json file is only few kbs so i want to avoid using dataframe in this case and parse the json on dirver rather than on all the workers – baiduXiu Jan 04 '17 at 03:51
  • we can restrict workers to one by change code as `master("local[1]")`. As you are running in `local` mode the worker and driver will be in same machine. – mrsrinivas Jan 04 '17 at 03:57
  • you can use `df.collect()` to get entire data to driver. – mrsrinivas Jan 04 '17 at 03:58
  • the job is parsing a small file and then a bigger parquet and joining the bigger file with the smaller file ,the smaller file doesnt need to run the distributed code i meant .the master should still execute on all the cores avaialable since it has to process large parquet file – baiduXiu Jan 04 '17 at 04:49
  • Then you have to load json file using HDFS api as the file located in hdfs. It's fine load the data and collect to driver. As the file is small it will create one partition only. – mrsrinivas Jan 04 '17 at 05:15
  • Not sure if this is a Spark question. The path to the file is not correct (hdfs needs "hdfs://" which clearly is missing). Also Scala has native support for JSON parsing, so as long as the OP does everything with RDDs he should be able to get this done. – marios Jan 04 '17 at 05:22
1

It is much more easy in spark 2.0

val df = spark.read.json("json/file/location/in/hdfs")
df.show()
  • it spawns a map reduce job for this .for a small json it is a overkill and hence i wanted to execute this with scala – baiduXiu Jan 11 '17 at 04:09
-1

One can use following in Spark to read the file from HDFS: val jsonText = sc.textFile("hdfs://url/user/request.json").collect.mkString("\n")