Overview
Currently my product maintains a DAL that is separated from business logic and exposed via a set of services where each service generally corresponds to an element i.e. Car
objects are accessed via the CarService
. These services are powered through Spring Data Repositories
and access data (models) stored in both PostgreSQL
and Elasticsearch
.
We are now processing more and more data (documents in, our models out or documents in, clustering, models out) and have realized that computation has become a bottleneck. To overcome this we are evaluating Spark or Apache Beam to distribute the computation horizontally which would solve the problem.
Problem
After looking into the Spark (and Beam) frameworks I have found that they generally provide their own integration (or plugin) for reading/writing from/to datasources, which in and of itself is great. The problem for me is that I can't find anyway for these frameworks to support distributed reading/writing through our current set of services. Spark requires RDD
and Beam requires PCollection
and I'd rather not support 2 methods of reading/writing from our datastores to accommodate.
My Question
Has anyone encountered this before? What was your strategy?
Did you go ahead and support 2 types of DAL? If so, were there any caveats with this especially with regards to the ongoing maintenance of the code?