How to extract the elements from csv to json in S3

Question

I need to find the csv files from the folder
List all the files inside the folder
Convert files to json and save in the same bucket

Csv file, Like below so many csv files are there

emp_id,Name,Company
10,Aka,TCS
11,VeI,TCS

Code is below

import boto3
import pandas as pd
def lambda_handler(event, context):
    s3 = boto3.resource('s3')
    my_bucket = s3.Bucket('testfolder')
    for file in my_bucket.objects.all():
        print(file.key)
    for csv_f in file.key:
        with open(f'{csv_f.replace(".csv", ".json")}', "w") as f:
            pd.read_csv(csv_f).to_json(f, orient='index')

Not able to save if you remove bucket name it will save in the folder. How to save back to bucket name

Yes, not writing into bucket if use normal csvdicreader. for pandas I am getting No module named 'numpy.core._multiarray_umath — aysh, Aug 11 '20 at 04:24
From the code it seems that it's saving on your local disk, you need to call a function to upload it to s3, something like `s3.upload_file(f.name, bucket_name, object_name)`. — geckos, Aug 11 '20 at 04:27
https://boto3.amazonaws.com/v1/documentation/api/latest/guide/s3-uploading-files.html Double check the file, bucket and object names — geckos, Aug 11 '20 at 04:29
Did you create your own lambda layer or package to have panda in lambda? — Marcin, Aug 11 '20 at 04:39

score 1 · Accepted Answer · answered Aug 11 '20 at 04:58

1

You can check the following code:

from io import StringIO

import boto3
import pandas as pd

s3 = boto3.resource('s3')

def lambda_handler(event, context):
    
    s3 = boto3.resource('s3')
    
    input_bucket = 'bucket-with-csv-file-44244'
    
    my_bucket = s3.Bucket(input_bucket)
    
    for file in my_bucket.objects.all():
        
        if file.key.endswith(".csv"):
           
            csv_f = f"s3://{input_bucket}/{file.key}"
            
            print(csv_f)
            
            json_file = file.key.replace(".csv", ".json")
            
            print(json_file)
            
            json_buffer = StringIO()
            
            df = pd.read_csv(csv_f)
            
            df.to_json(json_buffer, orient='index')
            
            s3.Object(input_bucket, json_file).put(Body=json_buffer.getvalue())

Your lambda layer will need to have:

fsspec
pandas
s3fs

answered Aug 11 '20 at 04:58

Marcin

215,873
14
235
294

can i ask fsspec and s3fs. what is the use – aysh Aug 11 '20 at 05:08
1

@aysh To read from s3. Panda can read directly from s3. Should be also able to write as well, but in my tests now, I didn't write. – Marcin Aug 11 '20 at 05:10
One last question why we need to convert stringI0. Sorry If i am disturbing you json_buffer = StringIO() and put(Body=json_buffer.getvalue()) I didnt get the info of using the line – aysh Aug 11 '20 at 05:19
@aysh This is a workaround. Normally panda should be able to write to s3. But in my tests it did not. Maybe you will have more luck. The alternative and more traditional way of writing to s3 is [here](https://stackoverflow.com/a/40615630/248823) which involves StringIO. – Marcin Aug 11 '20 at 05:22
if i want to give one different bucket name to save df.to_json(json_buffer, orient='index'). can i given parameter here. or do i need to create a function as upload_file() , i need to pass the bucket name – aysh Aug 11 '20 at 08:00
1

@aysh in `s3.Object(input_bucket, json_file)` you can change `input_bucket` to something else. – Marcin Aug 11 '20 at 08:02
can you answer below https://stackoverflow.com/questions/63618932/how-to-read-content-from-the-s3-bucket-as-url – aysh Aug 27 '20 at 15:07

How to extract the elements from csv to json in S3

1 Answers1

Linked