1

I wrote a web scraper in Python, which works very well on my Laptop at home. After deploying it to AWS EC2 the performance of the scraper deteriorates. Now I am confused about the performance of EC2 instances (even of the micro and small instances, see further details below).

Scraper in python: Generally, inner loop of the scrapes does the following: (1) the scraper looks up urls on a site (20 per site, one site = one "site_break"). In a second step it (2) gets the soruce code of each url, in a third step it (3) extracts the necessary information into an dataframe and in the fourth step it (4) saves the dataframe as pkl. After all loops it opens and merges the dataframs and saves it as csv.

The crucial (most time consuming parts) are: (2) download of source codes (I/O limited by download speed): the program fills the RAM with the source code (3) processing of the sources codes (CPU 100%)

To use the RAM fully and stick together similar processes, every loop consists of site_break = 100, i.e. 100 sites * 20 urls/site = 2000 urls. This fills the RAM of my PC to 96% (see below). Since I have to wait for server responses in step 1 and step 2, I implemented threading with maxWorkers=15 (alternatively 20-35 with similar results). This implementation cuts the run time by 80%. I am sure I could get some other .% by implementing asyncio. Nevertheless, I want to start with the lean MVP. In the processor consuming step 3 I didn't implement multiprocessing (yet), because my goal was an cost efficient/free implemenation on t2.micro (with just one processor).

Specification: Home PC: Intel Core i7-6500 CPU, 2.59 Ghz (2 Cores, 4 logical Processors), RAM 8.00 GiB, 64-bit, x64, 50Mbit/s Download-rate (effectively up to 45 Mbit/s), Python 3.7.3, conda env

EC2 t2.micro: vCPUs = 1, RAM 1.0 GiB, Network Performance "low to moderate" (research in forums tell my this could be something above 50 Mbit), Ubuntu 18.04, Python 3.7.3, conda env

EC2 t3a.small: vCPUs = 2, RAM 2.0 GiB, Network Performance "low to moderate" but another AWS site tells me: "up to 5 Gbit/s", Ubuntu 18.04, Python 3.7.3, conda env

Since the RAM of the t2.micro is just 1 GiB, I lowered the site_break from 100 to 25. Afterwards, the RAM still got full, so I decreased it in further steps from 25 to 15, 12, 10 and finally 5. For 12, 10 and especially for 5 it works pretty well: I needed 5:30min for on loop with site_break = 100 on my PC. t2.micro need 8-10sec for site_break = 5, which leads to 3:00min for analogous 100 sites, which satisfied me in the first moment. Unfortunately, the following issue appears: After 20-30 loops the performance plumments. The time for on loop increases rapidly from 8sec to over 2min. My first assumption was it to be the low RAM, during the small loops it doesn't seem run full. After stopping and cleaning the RAM, the performance drops after the second or third loop. If I start it a few hours later the first case (with drop after 20-30 loops) repeats.

Because I firstly thought it has to do with the RAM, i launched a second instance on t3a.small with more CPU, RAM and "up to 5 Gbit/a" network performance. I sliced to looks to site_break = 25 and startet the script. I is still running with a constant speed of 1:39-1:55min per loop (which is half as fast as t2.micro in its best phase (10 sec for 5 => 50 sec for 25). Parallely, I started the script from my home PC with site_break = 25 and it is constantly faster with 1:15-1:30min per loop. (Stopping the time manuall results in 10-15sec slower for downloading and 10-15 sec slower for processing). This all confuses me.

Now my questions:

  1. Why does the t2.micro detetoriate after several loops and why does the performance vary so wildly?
  2. Why is the t3a.small 50% slower than the t2.micro? I would assume that the "bigger" machine would be faster in any regard.

This lets me stuck:

  1. Don't want to use my home PC for regularly (daily scraping), since the connection aborts at 4am for a tiny period of time and leads to hanging up of the script). Moreover, I don't want the script run manually and the PC all the time and block my private internet stream.

  2. t2.micro: Is useless, because the performance after the deterioration is not acceptable.

  3. t3a.small: performance is 10-20% lower than private PC. I would expect it to be better somehow? This lets my doubt to scrape over an EC2. Moreover, I can't understand the lower performance in comparison to t2.micro at the beginning.

MaGarb
  • 43
  • 6
  • 1
    Micro and small machines are *way* more limited than the typical home PC. Because their typical use case is for small to moderate web server use, which doesn't need a whole lot of horsepower. Their CPU time is also shared and throttled among many instances, and you'll see your performance drop considerably if you keep using 100% CPU for a long-ish time. – deceze Jun 24 '19 at 10:06
  • 1
    Try some experimentation with different instance types (especially different instance families). You can use Spot Instances to reduce costs, so you could try some really big instances to see whether they work out better on a "cost per worker" basis. – John Rotenstein Jun 24 '19 at 10:59
  • To add to what @deceze said, "T" instances actually give you a fraction of a CPU [see docs](https://aws.amazon.com/ec2/instance-types/#Burstable_Performance_Instances). They're only appropriate for workloads that need intermittent CPU, not something that runs at 100% continuously. – kdgregory Jun 24 '19 at 11:22
  • As for the "t3a": the "a" means that it's using an ARM processor. You can probably find benchmarks comparing ARM to Intel, although any such benchmarks might not match your workload. – kdgregory Jun 24 '19 at 11:27
  • Thanks for the answers: since my script runs for several hours I am not sure, if Spot instances are appropriate? What if I start a job on an on demand instance. While the Job is running, the spot price for a booked instance Drops and the instance will be launched. Does the Running Script move? Don‘t I have to vindizier the instance, to copy the env and Modules? What if a Job starts on an Spot instance which will be terminated before finish? What happens with the interim results/Data on the instance? – MaGarb Jun 24 '19 at 11:43
  • 1
    I think you need much more data about what the machines are doing. e.g. get something like `vmstat 2` running and see what the machines are actually doing, i.e. are you IO/CPU bound, and at what points. note that RAM doesn't ever really "get full" and you probably want a better metric for this (e.g. number of pages being swapped to/from disk per second) – Sam Mason Jun 24 '19 at 12:40

2 Answers2

1

After some testing I could reach a satisfying solution:

  1. The switch of instances from t3a.small to t3.small accounts for a performance improvement of about 40% and even 10-20% faster than my home PC. Thanks to @kdgregory
  2. I improved my code by using asyncio in combination with multiprocessing instead of multithreading. This doesn't only speed up it even more, but lead to a better memory utilisation. Moreover, I got rid of the tags, which still existed after using bs4.BeautifulSoup (Python high memory usage with BeautifulSoup). Through the last improvement I could prevent the memory to increase while running.
  3. After the long program the memory was lower than before. I figured out that I made a beginner's mistake: https://www.linuxatemyram.com/ Fortunately, this is not a problem.

Now the code is even faster than on my home PC and runs automatically.

MaGarb
  • 43
  • 6
1
  1. Why does the t2.micro detetoriate after several loops and why does the performance vary so wildly?

If your RAM is not getting full, then this is most likely because Amazon is limiting the resources your instance is consuming whether that is CPU or I/O. Amazon will give you more compute and throughput for a while (to accommodate any short-term spikes) but you should not mistake that for baseline performance.

  1. Why is the t3a.small 50% slower than the t2.micro? I would assume that the "bigger" machine would be faster in any regard.

T3 instances are designed for applications with moderate CPU usage that experience temporary spikes in use. With t3 you are either paying a premium to be able to afford larger and more frequent spikes, or, you are getting less baseline performance (for the same price) to be able to afford larger and more frequent spikes. This does not match the web-scraping profile where you want constant CPU and I/O.

Ioannis Tsiokos
  • 835
  • 1
  • 12
  • 14