Pyspark - Reading JSON from each file from a directory and putting it into its own Dataframe

Question

I want to create 1 dataframe per file found in the directory Json in each file looks like:

[{
    "a": "Need Help",
    "b": 6377,
    "c": "Member",
    "d": 721,
    "timestamp": 1590990807.475662
  },
  {
    "a": "Need Help",
    "b": 6377,
    "c": "Member",
    "d": 721,
    "timestamp": 1590990807.475673
  },
  {
    "a": "Need Help",
    "b": 6377,
    "c": "Member",
    "d": 721,
    "timestamp": 1590990807.475678
  }]

I could do that with below code:

rdd = sparkSession.sparkContext.wholeTextFiles("/content/sample_data/test_data")
dict = rdd.collectAsMap()
for row,value in dict.items():
 df = spark.read.json(row)
 df.show()

Is there a better way to achieve the same? Thanks in Advance.

Maybe this can help? https://stackoverflow.com/questions/29686573/spark-obtaining-file-name-in-rdds — mazaneicha, Jun 01 '20 at 21:41

score 0 · Answer 1 · answered Jun 01 '20 at 21:26

I think the creation of the first rdd is redundant, why not just iterate over the text files in the directory and create a dataframe for each file?

import glob

path = /content/sample_data/test_data

all_files = glob.glob(path + "/*.txt")


for filename in all_files:
    df = spark.read.json(filename)
    df.show()

Pyspark - Reading JSON from each file from a directory and putting it into its own Dataframe

1 Answers1