Separate Data Access Layers for Distributed Compute

Question

Overview

Currently my product maintains a DAL that is separated from business logic and exposed via a set of services where each service generally corresponds to an element i.e. Car objects are accessed via the CarService. These services are powered through Spring Data Repositories and access data (models) stored in both PostgreSQL and Elasticsearch.

We are now processing more and more data (documents in, our models out or documents in, clustering, models out) and have realized that computation has become a bottleneck. To overcome this we are evaluating Spark or Apache Beam to distribute the computation horizontally which would solve the problem.

Problem

After looking into the Spark (and Beam) frameworks I have found that they generally provide their own integration (or plugin) for reading/writing from/to datasources, which in and of itself is great. The problem for me is that I can't find anyway for these frameworks to support distributed reading/writing through our current set of services. Spark requires RDD and Beam requires PCollection and I'd rather not support 2 methods of reading/writing from our datastores to accommodate.

My Question

Has anyone encountered this before? What was your strategy?

Did you go ahead and support 2 types of DAL? If so, were there any caveats with this especially with regards to the ongoing maintenance of the code?

You want to read /write inside apache spark using your service ? could you please explain littile , what you want to achive at the end ? — vaquar khan, Jun 25 '18 at 19:50

vaquar khan · Answer 1 · 2018-06-26T15:28:05.597

In software engineering, multi-tier architecture is a client-server architecture in which, the presentation, the application processing and the data management are logically separate processes, crosscutting concern or logical separation helps for performance, scalability and maintenance.

keep in mind that tiers are at logical levels which means that may or may not be many physical layers.

If you are going with Image 1 then no need new DAO layers but in Image 2 , i will suggest create separate project and use EAI pattern to communicate both projects

Image 1:

In image 1 you can process data and keep into database and use same DAO layer to get data

Image 2:

In image 2 You can create separate layer where you have to submit your jobs and collect results directly into your spring code .

https://docs.spring.io/spring-hadoop/docs/current/reference/html/springandhadoop-spark.html

Apacge Spark or Bigdata has diffrent archetecture styles ,plesae read following links .

Separate Data Access Layers for Distributed Compute

1 Answers1