0

writing a file s3 using spark usually creates a directory with two files success and the other file name starts with name as part which has actual data in s3 , how to load the same file using pandas dataframe since the file path changes because the file name Par with actual data varies in each run.

For example the file path at the time of writing : df. write. colaesce("s3"\testfolder.csv)

file stored in directory are sucess part-00-

i have a python job which reads the file to pandas dataframe

pd.read(s3\..........what is the path to specify here.................)

  • If you want to create a pandas dataframe from multiple csv files, this may be helpful: https://stackoverflow.com/questions/20906474/import-multiple-csv-files-into-pandas-and-concatenate-into-one-dataframe. Reading data into pandas from S3 may require the use of `StringIO` shown here https://stackoverflow.com/questions/30818341/how-to-read-a-csv-file-from-an-s3-bucket-using-pandas-in-python. – Ankur Jul 26 '20 at 19:31

1 Answers1

0

You will need the exact key to read and that can be done using the boto3 module

import boto3
s3 = boto3.resource('s3')
bucket_name = s3.Bucket('bucket-name')

file = []
for obj in bucket_name.objects.filter(Prefix= 'path', Delimiter = "_SUCCESS"):
   file.append(obj.key)

You can then use the below snippet to read the file as csv :

bucket_name = "bucket-name"
file_path = file[0]
obj = s3.Object(bucket_name, file_path)
get_obj = obj.get()['Body'].read().decode('utf-8')
pdf = pd.read_csv(StringIO(get_obj))
Abhinav
  • 81
  • 2