0

I have to execute a task twice per week. The task consists on fetching a 1.4GB csv file from a public ftp server. Then I have to process it (apply some filters, discard some rows, make some calculations) and then synch it to a Postgres database hosted on AWS RDS. For each row I have to retrieve a SKU entry on the database and determine wether it needs an update or not.

My question is if EC2 could work as a solution for me. My main concern is the memory.. I have searched for some solutions https://github.com/goodby/csv which handle this issue by fetching row by row instead of pulling it all to memory, however they do not work if I try to read the .csv directly from the FTP.

Can anyone provide some insight? Is AWS EC2 a good platform to solve this problem? How would you deal with the issue of the csv size and memory limitations?

Jonathan
  • 512
  • 2
  • 5
  • 17
  • `if I try to read the .csv directly from the FTP.` Please clarify, because FTP protocol doesn't do "streaming". – mootmoot Jun 14 '16 at 09:23

4 Answers4

2

You wont be able to stream the file directly from FTP, instead, you are going to copy the entire file and store it locally. Using curl or ftp command is likely the most efficient way to do this.

Once you do that, you will need to write some kind of program that will read the file a line at a time or several if you can parallelize the work. There are ETL tools available that will make this easy. Using PHP can work, but its not a very efficient choice for this type of work and your parallelization options are limited.

datasage
  • 19,153
  • 2
  • 48
  • 54
2

Of course you can do this on an EC2 instance (you can do almost anything you can supply the code for in EC2), but if you only need to run the task twice a week, the EC2 instance will be sitting idle, eating money, the rest of the time, unless you manually stop and start it for each task run.

A scheduled AWS Lambda function may be more cost-effective and appropriate here. You are slightly more limited in your code options, but you can give the Lambda function the same IAM privileges to access RDS, and it only runs when it's scheduled or invoked.

Karen B
  • 2,693
  • 1
  • 17
  • 19
  • Karen, do you think that the Lambda would handle in memory the 1.4GB of information? I am studying and I dont know if the 300 second limit might be an issue? – Jonathan Jun 14 '16 at 22:03
  • Lambda functions can be configured to use up to 1.5Gb of memory. You would be cutting that close, as your executing code will also need memory. For the execution time, you'd have to run and see. I don't know if there's a way to compute the performance of a Lambda function compared to any given instance running Linux. If you have more time than money, it's worth testing, although if your CSV file will grow over time, it may become an issue, unless AWS increases memory and execution time limits in the future. (This may be a question for their support team.) – Karen B Jun 14 '16 at 22:42
  • I did not use Lambda finally because of the reasons you told me, but also because I had to use two libraries and it was more complicated to do so in Lambda than installing them on a AWS EC2 Ubuntu instance. – Jonathan Jun 23 '16 at 03:41
1

FTP protocol doesn't do "streaming". You cannot read file from Ftp chunks by chunk.

Honestly, downloading the file and trigger run a bigger instance is not a big deal if you only run twice a week, you just choose r3.large (it cost less than 0.20/hour ), execute ASAP and stop it. The internal SSD disk space should give you the best possible I/O compare to EBS.

Just make sure your OS and code are deployed inside EBS for future reuse(unless you have automated code deployment mechanism). And you must make sure RDS will handle the burst I/O, otherwise it will become bottleneck.

Even better, using r3.large instance, you can split the CSV file into smaller chunks, load them in parallel, then shutdown the instance after everything finish. You just need to pay the minimal root EBS storage cost afterwards.

I will not suggest lambda if the process is lengthy, since lambda is only mean for short and fast processing (it will terminate after 300 seconds).

(update): If you open up a file, the simple ways to parse it is read it sequentially, it may not put the whole CPU into full use. You can split up of CSV file follow reference this answer here.

Then using the same script, you can call them simultaneously by sending some to the background process, example below show putting python process in background under Linux.

parse_csvfile.py csv1 & 
parse_csvfile.py csv2 & 
parse_csvfile.py csv3 & 

so instead single file sequential I/O, it will make use of multiple files. In addition, splitting the file should be a snap under SSD.

Community
  • 1
  • 1
mootmoot
  • 12,845
  • 5
  • 47
  • 44
  • So far I have some experience with AWS RDS and S3. Im studying right now Lambda, therefore Id like to ask you if you believe (aiming for your best guess) that my process will require more than 300 seconds? I will assume I have the csv on an S3 bucket. On the other hand, so are you suggesting I could just make some Python program to download the file process it and push it to my RDS database, test it and then push it to a EC2 instance which I should launch and stop per event? – Jonathan Jun 14 '16 at 22:01
  • @Jonathan : IMHO, some task may just not worth the trouble and cost invest on lambda. You may take more time(which is costly) to test lambda functionality and error handling. OTH, you can deploy timely and cheap EC2 instance with little testing and learning curve. https://serverlesscode.com/post/aws-lambda-limitations/ https://www.datawire.io/3-reasons-aws-lambda-not-ready-prime-time/ – mootmoot Jun 15 '16 at 07:42
  • I have realized that during the last two days. So I think Ill follow your advice. I decided to write some Python code using ftplib, csv and eloquent libraries and upload it to an EC2 as you suggested. Just one question... can you please elaborate a bit more on loading the chunks in parallel and the burst I/O? – Jonathan Jun 16 '16 at 04:10
0

So I made it work like this.

I used Python and two great libraries. First of all I created a Python code to request and download the csv file from the FTP so I could load it to the memory. The first package is Pandas, which is a tool to analyze large amounts of data. It includes methods to read files from a csv easily. I used the included features to filter and sort. I filtered the large csv by a field and created about 25 new smaller csv files, which allowed me to deal with the memory issue. I used as well Eloquent which is a library inspired by Laravel's ORM. This library allows you to create a connection using AWS public DNS, database name, username and password and make queries using simple methods, without writing a single Postgres query. Finally I created a T2 micro AWS instance, installed Pandas and Eloquent updated my code and that was it.

Jonathan
  • 512
  • 2
  • 5
  • 17