1

I get stuck with the following problem. I have around 30,000 JSON files stored in S3 inside a particular bucket. These files are very small; each one takes only 400-500 Kb, but their quantity is not so small.

I want to create DataFrame based on all these files. I am reading JSON files using wildcard as follows:

var df = sqlContext.read.json("s3n://path_to_bucket/*.json")

I also tried this approach since json(...) is deprecated:

var df = sqlContext.read.format("json").load("s3n://path_to_bucket/*.json")

The problem is that it takes a very long time to create df. I was waiting 4 hours and the Spark job was still running.

Is there any more efficient approach to collect all these JSON files and create a DataFrame based on them?

UPDATE:

Or at least is it possible to read last 1000 files instead of reading all files? I found out that one can pass options as follows sqlContext.read.format("json").options, however I cannot figure out how to read only N newest files.

Dinosaurius
  • 8,306
  • 19
  • 64
  • 113

2 Answers2

1

If you can get the last 1000 modified file names into a simple list you can simply call:

sqlContext.read.format("json").json(filePathsList: _*)

Please note that the .option call(s) are usually used to configure schema options.

Unfortunately, I haven't used S3 before, but I think you can use the same logic in the answer to this question to get the last modified file names: How do I find the last modified file in a directory in Java?

Community
  • 1
  • 1
Mousa
  • 2,926
  • 1
  • 27
  • 35
0

You are loading like 13Gb of information. Are you sure that it takes a long time in just to create the DF? Maybe it's running the rest of the application but the UI shows that.

Try just to load and print the first row of the DF.

Anyway, what is the configuration of the cluster?