merge multiple small json files using pyspark in s3

Question

I am a newbie in spark.

I have multiple small json files (1kb) in subdirectories of my s3 bucket. I want to merge all the files present in a single directory. Is there any optimized way in doing this using pyspark.

Directory structure: region/year/month/day/hour/multiple_json_files

I have many directories as indicated above and want to merge all files in a single directory.

P.S: I have tried using python but its taking more time, tried s3distcp but its the same result.

Can anyone please help me in this

use wildcard for `region/year/month/day/hour/` to `*/*/*/*/*/`. — Lamanus, Feb 16 '20 at 04:01

score 0 · Answer 1 · answered Feb 16 '20 at 02:21

you can achieve by below code

First we need to make sure the hadoop aws package is available when we load spark:

import os

os.environ['PYSPARK_SUBMIT_ARGS'] = "--packages=org.apache.hadoop:hadoop-aws:2.7.3 pyspark-shell"

Next we need to make pyspark available in the jupyter notebook:

import findspark
findspark.init()
from pyspark.sql import SparkSession

We need the aws credentials in order to be able to access the s3 bucket. We can use the configparser package to read the credentials from the standard aws file.

 import configparser
 config = configparser.ConfigParser()
 config.read(os.path.expanduser("~/.aws/credentials"))
 access_id = config.get(aws_profile, "aws_access_key_id") 
 access_key = config.get(aws_profile, "aws_secret_access_key")

We can start the spark session and pass the aws credentials to the hadoop configuration:

 sc=spark.sparkContext
 hadoop_conf=sc._jsc.hadoopConfiguration()
 hadoop_conf.set("fs.s3n.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
 hadoop_conf.set("fs.s3n.awsAccessKeyId", access_id)
 hadoop_conf.set("fs.s3n.awsSecretAccessKey", access_key)

Finally we can read the data and display it:

 df=spark.read.json("s3n://path_of_location/*.json")
 df.show()

merge multiple small json files using pyspark in s3

1 Answers1