0

I have a csv file on mobile in swift, I would like to pass it to AWS lambda for it to then process the file.

I read up that APi gateway can send it as a binary, however there may be situations where the file becomes 20MB in size, and so this does not seem like a reasonable way.

I did think of another solution, which was to upload to s3, then make s3 trigger a lambda function, and then use it as so.

The problem I had with this solution is that the processing must be done in realtime, and uploading to s3 may take a while, the user would be waiting for file upload, then lambda processing and then the response of the lambda, I do not think this would complete in lets say 5 seconds for a file with 200k records, and a complex processing algorithm?

Also, may I ask; theoretically if the file size was below or around 10MB what would be the best solution?

Thanks

Edit:

I was also thinking of using cloudfront, but there was sometimes delay in when the file was uploaded and when it became available, and I would only like the files to be available for a brief couple of minutes in the cache.

jscs
  • 63,694
  • 13
  • 151
  • 195
WeCanBeFriends
  • 641
  • 1
  • 10
  • 23

1 Answers1

0

I don't think uploading directly to Lambda will work in your case primarily due to Lambda payload size limits.

Here's what you could do:

  • Generate a pre-signed URL to allow your client to upload the file directly to S3
  • Make Lambda respond to objects created in your S3 bucket using streams
  • Lambda will process the file and use AWS SNS to send results to the client -OR- you could persist the Lambda results (S3, Dynamo, RDS...) and have another API endpoint that will be requested by the client every second checking if the results are available

Sending to S3 and then processing on Lambda will not add that much latency to the process. Retriving files from S3 within Lambda should be pretty fast, especially if both are in the same region, and S3 streams are also fast (should not take more than 250ms I believe).

But the best is for you to deploy this solution and test the latency to see it's acceptable.

Renato Byrro
  • 3,578
  • 19
  • 34
  • Thanks for the informative answer, Renato! I have been forced to split my lambdas up, would your solution work if: - Generate presigned url - Trigger lambda 1 - Lambda Processes file once and saves in s3 - Trigger 2 - Lambda Processes file again Once I have processed it five times for example, then send user a sns to say it's ready? I think a step function would be more suitable. I do however liked the idea of a pre-signed url. My main worry is lambda timing out on a dataset with 100k rows, 300 seconds is not acceptable for me to run a lambda too. Thanks for the answer! – WeCanBeFriends Jan 30 '18 at 13:46
  • Also on the payload size, what if I was to gzip it? I see that it says 6MB for a sync response – WeCanBeFriends Jan 30 '18 at 14:14
  • @KyleGraham note that the payload size limit is for the JSON invoke request object, which is character based. Gzipped data is binary, not character, so it can't be passed over a JSON interface without base64 encoding... which only codes 6 input bits per 8 output bits... so a binary stream expands to 8/6ths size when you encode with base64. This would put your usable input size closer to 4.5M after gzipping but before base64 encoding (4.5M binary × 8 ÷ 6 = 6M character). Still usually much smaller than the original payload before gzipping but a somewhat hidden limiting factor to keep in mind. – Michael - sqlbot Jan 30 '18 at 17:20
  • About the payload 6 MB limit, well, yes you could still upload directly to Lambda if you split/compress your files to stay under the limit. The problem is: you're already concerned about Lambda timing out and, by uploading directly to it, you'll consume it's 5-minute hard execution limit with network latency. And it's unpredictable: what if your client happens to be in a very slow mobile connection? You now need to split your files in even smaller chunks, but how can you determine this upfront? I wouldn't go with this solution, it's certainly going to fail at some point. – Renato Byrro Jan 30 '18 at 17:41
  • About split in multiple Lambdas: is it 1) a different kind of processing; or 2) are you just splitting the job to prevent Lambda from timing out? If it's #1, then fine, go ahead and split it. The first Lambda could call second Lambda directly using an AWS SDK, and this second Lambda would send the results back to the client. If it's #2, what I would do is create a "Composer" Lambda and an "Worker" Lambda. S3 would trigger the Composer, which would split the file appropriately and call the Worker several times in parallel. Composer takes all responses, put them together and send to the client. – Renato Byrro Jan 30 '18 at 17:46
  • @RenatoByrro This has definitely solved what I need. Thank you x100 Renato! – WeCanBeFriends Jan 30 '18 at 18:11