you can achieve by below code
First we need to make sure the hadoop aws package is available when we load spark:
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = "--packages=org.apache.hadoop:hadoop-aws:2.7.3 pyspark-shell"
Next we need to make pyspark available in the jupyter notebook:
import findspark
findspark.init()
from pyspark.sql import SparkSession
We need the aws credentials in order to be able to access the s3 bucket. We can use the configparser package to read the credentials from the standard aws file.
import configparser
config = configparser.ConfigParser()
config.read(os.path.expanduser("~/.aws/credentials"))
access_id = config.get(aws_profile, "aws_access_key_id")
access_key = config.get(aws_profile, "aws_secret_access_key")
We can start the spark session and pass the aws credentials to the hadoop configuration:
sc=spark.sparkContext
hadoop_conf=sc._jsc.hadoopConfiguration()
hadoop_conf.set("fs.s3n.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
hadoop_conf.set("fs.s3n.awsAccessKeyId", access_id)
hadoop_conf.set("fs.s3n.awsSecretAccessKey", access_key)
Finally we can read the data and display it:
df=spark.read.json("s3n://path_of_location/*.json")
df.show()