0

i want to implement an aws lambda function that will execute the following python script:

directory = os.fsencode(directory_in_string)

def transform_csv(csv):

   for file in os.listdir(directory):
       filename = os.fsdecode(file)

       d = open(r'C:\Users\r.reibold\Documents\GitHub\groovy_dynamodb_api\historische_wetterdaten\{}'.format(filename))

       data = json.load(d)

       df_historical = pd.json_normalize(data)

       #Transform to datetime
       df_historical["dt"] = pd.to_datetime(df_historical["dt"], unit='s', errors='coerce').dt.strftime("%m/%d/%Y %H:%M:%S")

       df_historical["dt"] = pd.to_datetime(df_historical["dt"])

.
.
.
.
  

My question is now:

How do i have to change the os. commands because i need to reference to the s3 bucket and not my local directory?

My first attempt looks like this

DIRECTORY = 's3://weatherdata-templates/historische_wetterdaten/New/'
BUCKET = 'weatherdata-templates'

s3 = boto3.client('s3')
paginator = s3.get_paginator('list_objects_v2')
pages = paginator.paginate(Bucket=BUCKET, Prefix=DIRECTORY)

def lambda_handler(event, context):

    for page in pages:
        for obj in page['Contents']:

            filename = s3.fsdecode(obj)

            d = open(r's3://102135091842-weatherdata-templates/historische_wetterdaten/New/{}'.format(filename))

            data = json.load(d)

            df_historical = pd.json_normalize(data)
.
.
.

Am i on the right track or completely wrong? Thx.

Ermiya Eskandary
  • 15,323
  • 3
  • 31
  • 44
  • 3
    Download the file to local & then open it. `open` can't read objects in s3 – rdas Nov 08 '21 at 15:03
  • Ok but is there a way to do that without downloading it to local? – Hector Devough Nov 08 '21 at 15:21
  • You can't read the contents of a file in s3 without downloading it. – rdas Nov 08 '21 at 15:23
  • 1
    You might consider using the [smart_open](https://github.com/RaRe-Technologies/smart_open) Python package. It does some of the work for you of streaming objects from S3, so you just have a file object you can use in some places instead of an object returned from `open`. – Anon Coward Nov 08 '21 at 17:27

1 Answers1

1

Not quite there yet :)

Unfortunately, you can't do open(...) directly on an S3 URL as it's not a file object.

To load the object contents without storing the file locally, try using the S3 Boto3 resource which provides higher-level access to the S3 SDK.

  1. Get the key of the object from obj['Key'].
  2. Use obj.get()['Body'] to get the contents as a StreamingBody
  3. Call .read() on the StreamingBody to get the object in byte format & decode it to a UTF-8 string (or any other encoding that your file(s) is in)
  4. Convert JSON string to a JSON object using json.loads(...)
import boto3
s3_resource = boto3.resource('s3')
...
def lambda_handler(event, context):
    for page in pages:
        for obj in page['Contents']:
            obj_reference = s3_resource.Object(BUCKET, obj['Key'])
            body = obj_reference.get()['Body'].read().decode('utf-8')
            data = json.loads(body)
            df_historical = pd.json_normalize(data)
            ...
Ermiya Eskandary
  • 15,323
  • 3
  • 31
  • 44
  • Ok this looks like what im looking for. I will try that. Another question. I read, that one have to encode the body as well. So i would put a `.decode('utf-8')` behind the `.read()` what do you think? – Hector Devough Nov 08 '21 at 16:06
  • You can't decode the `StreamingBody` AFAIK - do you have a reference for that? I don't even think that will compile/run – Ermiya Eskandary Nov 08 '21 at 16:10
  • Here in the doc it says that the metadata is encoded: https://docs.aws.amazon.com/AmazonS3/latest/userguide/object-keys.html I looked it up on stackoverflow and found this where they use a similar solution like you suggested but they use this decoding part: https://stackoverflow.com/questions/40995251/reading-an-json-file-from-s3-using-python-boto3/47121263 – Hector Devough Nov 08 '21 at 16:13
  • I can't see anyone decoding the actual `StreamingBody` - try the above and let me know if you have any issues. The file name may be encoded but just try and report back – Ermiya Eskandary Nov 08 '21 at 16:15
  • It seems like it works. But now im running into a RAM problem i think. At the end of the python script it transfers data from a json file to DynamoDB. There a 2 .json files in the bucket with 68MB each. Only Error i get is this one: `OpenBLAS WARNING - could not determine the L2 cache size on this system, assuming 256k` Is there a way to adjust the RAM? Sorry if this is a stupid question. Im relatively new to AWS, Stacks, YAML Files etc. – Hector Devough Nov 09 '21 at 13:51
  • 1
    I selected your answer as helpful but i cant upvote because i have too less reputation. :/ But thank you very much @Ermiya Eskandary. I will open a new question. :) – Hector Devough Nov 09 '21 at 14:00
  • @HectorDevough Feel free to add a link to your new question here too when done :) – Ermiya Eskandary Nov 09 '21 at 14:03
  • here we go: https://stackoverflow.com/questions/69899946/aws-s3-to-dyanmodb-openblas-warning-could-not-determine-the-l2-cache-size-on-t – Hector Devough Nov 09 '21 at 14:31