I wrote a web scraper in Python, which works very well on my Laptop at home. After deploying it to AWS EC2 the performance of the scraper deteriorates. Now I am confused about the performance of EC2 instances (even of the micro and small instances, see further details below).
Scraper in python: Generally, inner loop of the scrapes does the following: (1) the scraper looks up urls on a site (20 per site, one site = one "site_break"). In a second step it (2) gets the soruce code of each url, in a third step it (3) extracts the necessary information into an dataframe and in the fourth step it (4) saves the dataframe as pkl. After all loops it opens and merges the dataframs and saves it as csv.
The crucial (most time consuming parts) are: (2) download of source codes (I/O limited by download speed): the program fills the RAM with the source code (3) processing of the sources codes (CPU 100%)
To use the RAM fully and stick together similar processes, every loop consists of site_break = 100, i.e. 100 sites * 20 urls/site = 2000 urls. This fills the RAM of my PC to 96% (see below). Since I have to wait for server responses in step 1 and step 2, I implemented threading with maxWorkers=15 (alternatively 20-35 with similar results). This implementation cuts the run time by 80%. I am sure I could get some other .% by implementing asyncio. Nevertheless, I want to start with the lean MVP. In the processor consuming step 3 I didn't implement multiprocessing (yet), because my goal was an cost efficient/free implemenation on t2.micro (with just one processor).
Specification: Home PC: Intel Core i7-6500 CPU, 2.59 Ghz (2 Cores, 4 logical Processors), RAM 8.00 GiB, 64-bit, x64, 50Mbit/s Download-rate (effectively up to 45 Mbit/s), Python 3.7.3, conda env
EC2 t2.micro: vCPUs = 1, RAM 1.0 GiB, Network Performance "low to moderate" (research in forums tell my this could be something above 50 Mbit), Ubuntu 18.04, Python 3.7.3, conda env
EC2 t3a.small: vCPUs = 2, RAM 2.0 GiB, Network Performance "low to moderate" but another AWS site tells me: "up to 5 Gbit/s", Ubuntu 18.04, Python 3.7.3, conda env
Since the RAM of the t2.micro is just 1 GiB, I lowered the site_break from 100 to 25. Afterwards, the RAM still got full, so I decreased it in further steps from 25 to 15, 12, 10 and finally 5. For 12, 10 and especially for 5 it works pretty well: I needed 5:30min for on loop with site_break = 100 on my PC. t2.micro need 8-10sec for site_break = 5, which leads to 3:00min for analogous 100 sites, which satisfied me in the first moment. Unfortunately, the following issue appears: After 20-30 loops the performance plumments. The time for on loop increases rapidly from 8sec to over 2min. My first assumption was it to be the low RAM, during the small loops it doesn't seem run full. After stopping and cleaning the RAM, the performance drops after the second or third loop. If I start it a few hours later the first case (with drop after 20-30 loops) repeats.
Because I firstly thought it has to do with the RAM, i launched a second instance on t3a.small with more CPU, RAM and "up to 5 Gbit/a" network performance. I sliced to looks to site_break = 25 and startet the script. I is still running with a constant speed of 1:39-1:55min per loop (which is half as fast as t2.micro in its best phase (10 sec for 5 => 50 sec for 25). Parallely, I started the script from my home PC with site_break = 25 and it is constantly faster with 1:15-1:30min per loop. (Stopping the time manuall results in 10-15sec slower for downloading and 10-15 sec slower for processing). This all confuses me.
Now my questions:
- Why does the t2.micro detetoriate after several loops and why does the performance vary so wildly?
- Why is the t3a.small 50% slower than the t2.micro? I would assume that the "bigger" machine would be faster in any regard.
This lets me stuck:
Don't want to use my home PC for regularly (daily scraping), since the connection aborts at 4am for a tiny period of time and leads to hanging up of the script). Moreover, I don't want the script run manually and the PC all the time and block my private internet stream.
t2.micro: Is useless, because the performance after the deterioration is not acceptable.
t3a.small: performance is 10-20% lower than private PC. I would expect it to be better somehow? This lets my doubt to scrape over an EC2. Moreover, I can't understand the lower performance in comparison to t2.micro at the beginning.