0

I am looking a code snippet to find the best practice to read multiple nested JSON files under sub directories in hadoop using scala .

If we can write into one single file in some other directory in hadoop the above JSON files , that would be even better .

Any help is appreciated.

Thanks PG

  • : are you using Spark with Scala API or how you are using Scala in Hadoop? – Shankar Sep 29 '16 at 06:44
  • Thanks for your response. I am using spark with scala API . – user3054752 Sep 29 '16 at 10:36
  • 1
    You can use `sqlContext.read.json("json file path")` to read json file, it returns an `DataFrame`. But you said nested directories, is the json files are having different schemas? – Shankar Sep 29 '16 at 14:38
  • Thanks Shankar . Files will be of similar schemas , and I guess it worked to read the files. Now next step is can I write all the files into one single json file may be in 1-2 steps to be performance efficient. – user3054752 Sep 29 '16 at 20:20
  • Take a look here. I think the top answer may help: http://stackoverflow.com/questions/28203217/how-to-load-directory-of-json-files-into-apache-spark-in-python – sascha10000 Sep 29 '16 at 23:31

1 Answers1

0

You can use sqlContext.read.json("input file path") to read json file, it returns an DataFrame.

Once you got the DataFrame, just use df.write.json("output file path") to write the DF as json file.

Code example: if you use Spark 2.0

val spark = SparkSession
      .builder()
      .appName("Spark SQL JSON example")
      .getOrCreate()

      val df = spark.read.json("input/file/path")

      df.write.json("output/file/path")
Shankar
  • 8,529
  • 26
  • 90
  • 159