0

I am new to both PySpark and AWS EMR. I have been given a small project where I need to scrub large amounts of data files every hour and build aggregated data sets based on them. These data files are stored on S3 and I can utilize some of the basic functions in Spark (like filter and map) to derive the aggregated data. To save on egress costs and after performing some CBA analysis, I decided to create an EMR cluster and make pypark calls. The concept is working fine using Lambda functions triggered by file created in the S3 bucket. I am writing the output files back to S3.

  1. But I am not able to comprehend the need for the 3 node EMR cluster I created and its use for me. How can I use the Hadoop file system to my advantage here and all the storage that is made available on the nodes?
  2. How do I view (if possible) the utilization of the slave/core nodes in the cluster? How do I know they are used, how often, etc etc? I am executing the pyspark code on the master node.
  3. Are there alternatives to EMR that I can use with pyspark?

Is there any good documentation available to get a better understanding.

Thanks

NetRocks
  • 467
  • 6
  • 25

1 Answers1

2
  1. Spark is a framework for distributed computing. It can process larger than memory datasets and split the workload in chunks onto multiple workers in parallel. By default EMR creates 1 master node and 2 worker nodes. The disk space on the spark nodes is typically not used directly. Spark can use the space to cache temp results.
    To use a Hadoop filesystem, you need to start a hdfs service in aws . However s3 is also distributed storage. It is supported by Hadoop libraries. Spark EMR ships with Hadoop drivers and support S3 out of the box. Using spark with S3 is perfectly valid storage solution and will be good enough for a lot of basic data processing tasks.

  2. The is a spark manager UI in AWS EMR. You can see each running spark application session and current job. By clicking on the job you can see how many executors are used. Whether those executors run on all nodes depends on your spark memory and cpu configuration. Tuning those is a really big topic. There are good hints here on SO. There is also a hardware monitoring tab, showing cpu and memory usage for each node. The spark code is always executed on the master node. But it just creates a DAG plan on that node and shifts the actual work to the worker nodes according to the plan. Hence the guides speak of submitting the spark application rather than executing.

  3. Yes. You can start your own spark cluster on normal ec2 instances. There is even a standalone mode , allowing to start spark on only one machine. It is quite some footprint, that is installed then. And you still need to tune the memory, cpu and executor settings. So it is quite a complexity compared to just implement some multiprocessing in python or use dask. However there are valid reasons to do so. It allows to use all cores on one machine. And it allows you to use a well known , good documented api. The same one, which can be used to process petabytes of data. The linked article above, explains the motivation.

    Another possibility is to use AWS Glue. It is serverless spark. The service will submit your jobs to some on demand spark nodes on AWS, where you have no control over. Similar to how lambda functions run on random AWS EC2 instances. However glue has some limitations. With pyspark on glue, you cannot install python libs with c-extensions e.g numpy, pandas, most of ml libs. Also Glue forces you to create schema mapping of your data in Athena catalog. But standalone spark can just process those on the fly.

    Databricks also offers a separate serverless spark solution outside of AWS. It is more sophisticated in my opinion. It also allows custom c-extensions.

    Big part of official documentation is focusing on the different data processing apis and not on the internals of apache spark. There are some good notes on spark internals on github. I assume every good book will cover some inner workings on spark. AWS EMR is just an automated spark cluster with yarn orchestrator. (Unfortunately, never read some good book on spark, got some info here and there, so cannot recommend one)

dre-hh
  • 7,840
  • 2
  • 33
  • 44