Distributing a python function across multiple worker nodes

Question

I'm trying to understand what would be a good framework that integrates easily with existing python code and allows distributing a huge dataset across multiple worker nodes to perform some transformation or operation on it.

The expectation is that each worker node should be assigned data based on a specific key(here country as given in transaction data below), where the worker performs required transformation and returns the results to the leader node.

Finally, the leader node should perform an aggregation of the results obtained from the worker nodes and return one final result.

transactions = [
    {'name': 'A', 'amount': 100, 'country': 'C1'},
    {'name': 'B', 'amount': 200, 'country': 'C2'},
    {'name': 'C', 'amount': 10, 'country': 'C1'},
    {'name': 'D', 'amount': 500, 'country': 'C2'},
    {'name': 'E', 'amount': 400, 'country': 'C3'},
]

I came across a similar question, where Ray is suggested as an option but does Ray allow defining specifically which worker gets the data based on a key?
Another question talks about using pySpark for this, but then how do you make the existing python code work with PySpark with minimal code change since pySpark has its own api's?

score 2 · Answer 1 · answered Jan 30 '23 at 19:38

Ray is exactly a framework for distributing the python functions and classes with minimum changes to existing code (just @ray.remote annotation) in a cluster. The runtime architecture is sort of as what you expected, i.e. a driver node to distribute the tasks to a bunch of worker nodes and coordinate the computation.

Your question talked about two things: performing aggregation and assigning tasks to workers based on a specific key. As @Jonathan Lam answered, usually you do not directly control the task-to-worker assignment. Users usually work on the framework's abstractions via APIs.

Suppose that is the case and the end goal is aggregation, you can do aggregation easily in Ray:

import ray

data = [
    {'name': 'A', 'amount': 100, 'country': 'C1'},
    {'name': 'B', 'amount': 200, 'country': 'C2'},
    {'name': 'C', 'amount': 10, 'country': 'C1'},
    {'name': 'D', 'amount': 500, 'country': 'C2'},
    {'name': 'E', 'amount': 400, 'country': 'C3'},
]

ds = ray.data.from_items(data)

ds = ds.groupby('country').sum('amount')

ds.show()

However, if you do want to have low-level control of task-to-worker assignment, Ray (at the Ray Core level, not libraries like Ray Datasets) allows you specify the scheduling strategy for a task. In this case, you may use the NodeAffinitySchedulingStrategy when you launch the remote task, see details in: https://docs.ray.io/en/master/ray-core/scheduling/index.html#nodeaffinityschedulingstrategy

score 1 · Accepted Answer · answered Jan 28 '23 at 15:30

Based on your question and posts that you cited, your post actually covers three questions:

Data assignment into different nodes based on specific key: As you mentioned that you have multiple worker nodes, it forms a cluster to perform parallel computing. If you look at some distributed data processing / query engine, such as Spark and Trino, you can't assign specific key data into dedicated worker node, but you can partition your data in an even distribution so that each worker node can take partitions and perform parallel computing to increase the speed. Taking Spark as an example, it can perform data repartitioning based on your input partition key parameter and the number of partitions. But the real question is, how does your data partitioning strategy help optimize and utilize your cluster resource and computation speed?
Does Ray allow defining specifically which worker gets the data based on a key: To be honest, I haven't use Ray before, I can't answer this question at this moment, but I just took a look to their whitepaper, it looks like their architecture is similar to modern distributed processing framework, which use a driver / header / coordinator to control the tasks to different nodes. I suspect it can achieve the 2nd question, but again, I'm not sure about this question.
How PySpark achieve aggregation with minimal code change: Not sure how's your current pipeline in python code. Assuming you're using Pandas library, PySpark actually has pandas-in-spark api (https://spark.apache.org/docs/latest/api/python/getting_started/quickstart_ps.html). Therefore it should only have minimal code change. Even you're not using Pandas but pure python logic, thanks for the contributor in Spark, it creates very convenient Spark API. For example, you if want to perform aggregation in Spark SQL API: df.groupBy('country').agg(func.count('name').alias('name_count'), func.sum('amount').alias('amount_sum')). Coding level is simple, but again, the spark performance tuning is the critical part to utilize and optimize your resources,.

Distributing a python function across multiple worker nodes

2 Answers2