0

I want to create 1 dataframe per file found in the directory Directory Folder structure Json in each file looks like:

[{
    "a": "Need Help",
    "b": 6377,
    "c": "Member",
    "d": 721,
    "timestamp": 1590990807.475662
  },
  {
    "a": "Need Help",
    "b": 6377,
    "c": "Member",
    "d": 721,
    "timestamp": 1590990807.475673
  },
  {
    "a": "Need Help",
    "b": 6377,
    "c": "Member",
    "d": 721,
    "timestamp": 1590990807.475678
  }]

I could do that with below code:

rdd = sparkSession.sparkContext.wholeTextFiles("/content/sample_data/test_data")
dict = rdd.collectAsMap()
for row,value in dict.items():
 df = spark.read.json(row)
 df.show()

Is there a better way to achieve the same? Thanks in Advance.

Sanguine
  • 973
  • 1
  • 8
  • 10

1 Answers1

0

I think the creation of the first rdd is redundant, why not just iterate over the text files in the directory and create a dataframe for each file?

import glob

path = /content/sample_data/test_data

all_files = glob.glob(path + "/*.txt")


for filename in all_files:
    df = spark.read.json(filename)
    df.show()