I'm trying to understand what would be a good framework that integrates easily with existing python code and allows distributing a huge dataset across multiple worker nodes to perform some transformation or operation on it.
The expectation is that each worker node should be assigned data based on a specific key(here country as given in transaction data below), where the worker performs required transformation and returns the results to the leader node.
Finally, the leader node should perform an aggregation of the results obtained from the worker nodes and return one final result.
transactions = [
{'name': 'A', 'amount': 100, 'country': 'C1'},
{'name': 'B', 'amount': 200, 'country': 'C2'},
{'name': 'C', 'amount': 10, 'country': 'C1'},
{'name': 'D', 'amount': 500, 'country': 'C2'},
{'name': 'E', 'amount': 400, 'country': 'C3'},
]
I came across a similar question, where Ray is suggested as an option but does Ray allow defining specifically which worker gets the data based on a key?
Another question talks about using pySpark for this, but then how do you make the existing python code work with PySpark with minimal code change since pySpark has its own api's?