6

I'm a noob to AWS and lambda, so I apologize if this is a dumb question. What I would like to be able to do is load a spreadsheet into an s3 bucket, trigger lambda based on that upload, have lambda load the csv into pandas and do stuff with it, then write the dataframe back to a csv into a second s3 bucket.

I've read a lot about zipping a python script and all the libraries and dependencies and uploading that, and thats a separate question. I've also figured out how to trigger lambda upon uploading a file to an S3 bucket and to automatically copy that file to a second s3 bucket.

The part I'm having trouble finding any information on is that middle part, the loading the file into pandas and manipulating the file within pandas all inside the lambda function.

First question: Is something like that even possible? Second question: how do I "grab" the file from the s3 bucket and load it into pandas? would it be something like this?

import pandas as pd
import boto3
import json
s3 = boto3.resource('s3')

def handler(event, context):
     dest_bucket = s3.Bucket('my-destination-bucket')
     df = pd.read_csv(event['Records'][0]['s3']['object']['key'])
     # stuff to do with dataframe goes here

     s3.Object(dest_bucket.name, <code for file key>).copy_from(CopySource = df)

? I really have no idea if that's even close to right and is a complete shot in the dark. Any and all help would be really appreciated, because I'm pretty obviously out of my element!

Tkelly
  • 187
  • 1
  • 2
  • 11
  • 1
    It should be possible, see the following question to read the file from s3 to pandas. https://stackoverflow.com/questions/37703634/how-to-import-a-text-file-on-aws-s3-into-pandas-without-writing-to-disk – Usman Azhar Jan 16 '18 at 22:44
  • Thanks for your response. It seems that response is more for accessing a file in an s3 bucket, but Lambda isn't used at all, but seems to rather be just a normal python script. How would I make modifications to do that within a an AWS lambda function as per my question? – Tkelly Jan 16 '18 at 22:55
  • you can use the python script within your `handler` method or write a separate method. it explains the step to do it ,in your case you need to put that inside the lamda function since you have already configured the lamda trigger it should work. – Usman Azhar Jan 16 '18 at 22:59
  • 1
    It looks like you are passing the S3 object key to the pandas read_csv() method. An S3 key is of the form dir1/dir2/file.csv. What you need is the S3 URI for the object, and that's of the form s3://bucket/dir1/dir2/file.csv. So, construct the proper URI from the bucket and key in the event object and then pass it to pandas read_csv(). – jarmod Jan 17 '18 at 00:49

1 Answers1

1

This code triggers a Lambda function on PUTS, then GETS it, then PUTS it into another bucket:

from __future__ import print_function
import os
import time
import json
import boto3

s3 = boto3.client('s3')

def lambda_handler(event, context):
    bucket = event['Records'][0]['s3']['bucket']['name']
    key = quote(event['Records'][0]['s3']['object']['key'].encode('utf8'))
    try:
        response = s3.get_object(Bucket=bucket, Key=key)
        s3_upload_article(response, bucket, end_path)
        return response['ContentType']
    except Exception as e:
        print(e)
        print('Error getting object {} from bucket {}. Make sure they exist and your bucket is in the same region as this function.'.format(key, bucket))
        raise e

def s3_upload_article(html, bucket, end_path):
    s3.put_object(Body=html, Bucket=bucket, Key=end_path, ContentType='text/html', ACL='public-read')

I broke this code out from a more complicated Lambda script I have written, however, I hope it displays some of what you need to do. The PUTS of the object only triggers the scipt. Any other actions that occur after the event is triggered are up to you to code into the script.

bucket = event['Records'][0]['s3']['bucket']['name']
key = quote(event['Records'][0]['s3']['object']['key'].encode('utf8'))

Bucket and key in the first few lines are the bucket and key of the object that triggered the event. Everything else is up to you.

  • Thanks for your response. I'm still having trouble. I can successfully load my csv into pandas and manipulate it, but I'm really struggling with how to then take my dataframe, turn it back into a csv, and then put that file into a new bucket. Could you lend any clarity on how I can accomplish that? – Tkelly Jan 17 '18 at 22:11
  • @Tkelly I have never used Panda's before but it appears there is `pandas.DataFrame.to_csv` function that may accomplish this? https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.to_csv.html Are you having trouble on that step or the PUTS? – Nicholas Martinez Jan 17 '18 at 22:38
  • @Tkelly Did you checked this post ? https://stackoverflow.com/questions/38154040/save-dataframe-to-csv-directly-to-s3-python . It explains the steps for you to write the dataframe to s3 bucket directly. – Usman Azhar Jan 17 '18 at 23:41
  • @UsmanAzhar I hadn't found that question before, thanks for pointing it out. I think that might be exactly what I need. – Tkelly Jan 18 '18 at 15:18
  • @NicholasMartinez the `pd.to_csv()` method doesn't work on it's own as s3 wants the data in a different format. I think @UsmanAzhar pointed me in the right direction. I'm gonna give it a try and find out! Thanks again to both of you for the assistance. – Tkelly Jan 18 '18 at 15:21
  • @Tkelly can you please share on how did you load csv into dataframe using pandas (in lambda function) helpful links will also do. I'm struggling with numpy dependency issues even after it is present. – T3J45 Oct 15 '18 at 15:12
  • What does the quote function do? And how do we import it? Can we test this event in our local environment? As I am using pandas and creating an image out of docker, I want to test it there and there itself. – Mohseen Mulla Apr 20 '22 at 11:28