0

I am able to read multiple csv files from S3 bucket with boto3 in python and finally combine those files in single dataframe in pandas.However, in some of the folders there are some empty files which results in the error "No columns to parse from file". Can we skip those empty files in the below codes?

s3 = boto3.resource('s3')
bucket = s3.Bucket('testbucket')

prefix_objs = bucket.objects.filter(Prefix="extracted/abc")

    prefix_df = []

for obj in prefix_objs:
    key = obj.key
    body = obj.get()['Body'].read()
    temp = pd.read_csv(io.BytesIO(body),header=None, encoding='utf8',sep=',')        
    prefix_df.append(temp)

I have used this ans [https://stackoverflow.com/questions/52855221/reading-multiple-csv-files-from-s3-bucket-with-boto3][1]

Kumar Gaurav
  • 3
  • 1
  • 2

2 Answers2

1
s3 = boto3.resource('s3')
bucket = s3.Bucket('testbucket')

prefix_objs = bucket.objects.filter(Prefix="extracted/abc")

prefix_df = []

for obj in prefix_objs:
    try:
        key = obj.key
        body = obj.get()['Body'].read()
        temp = pd.read_csv(io.BytesIO(body),header=None, encoding='utf8',sep=',')        
        prefix_df.append(temp)
    except:
        continue
Wrench
  • 490
  • 4
  • 13
0

I get the same error as the OP using the same code. When I execute the below code to print the names of all objects in the folder (inputfiles) which resides inside the bucket (testbucket), I see 3 keys listed even though I only have 2 objects. The last two keys list the text files inside of the folder, which is of interest to me, while the first key points to the folder that contains the two csv files.

s3 = boto3.resource('s3')
my_bucket = s3.Bucket('testbucket')

for file in my_bucket.objects.filter(Prefix="inputfiles/"):
        print(file.key)

The reason for the error: "No columns to parse from file", is that the for loop is trying to parse the folder and the folder does not have a 'body' associated with it. The code executes as expected when we use the try and exempt blocks like below and also prints the name of the key that is causing the error.

for file in my_bucket.objects.filter(Prefix="inputfiles/"):
    try:
        body = file.get()['Body'].read()
        temp = pd.read_csv(io.BytesIO(body), encoding='utf8', sep=',')
        print(temp.head()) ## you may print or append data to a data frame 
    except:
        print(file.key) ## This will print the key that has no columns to parse
        continue
Mostafa
  • 815
  • 1
  • 13
  • 23
Mugdhap
  • 1
  • 1