I'm new to big data processing and I'm reading about tools for stream processing and building data pipelines. I found Apache Spark and Spring Cloud Data Flow. I want to know the main differences and the pros and cons of them. Could anybody help me?
2 Answers
They are 2 completely different tools.
Spring Data Flow is a toolkit for building data integration and real-time data processing pipelines. This tool will help you to orchestrate data pipelines using Spring Boot Apps (Stream or Task). Under the hood, SCDF might use Spring Batch. Note this Spring Boot Apps can call Spark or Kafka applications to support Stream processing.
Apache Spark is an engine for data processing, it is being highly used for data intensive processing and data science. It has libraries such as ML (Machine Learning), Graph (graph processing), integration with Apache Kafka (Spark Streaming), among others.
For streaming, I highly recommend you to study Apache Kafka.

- 4,208
- 25
- 46
-
Note that under the hood, Spring Data Flow use Kafka or RabbitMQ for streaming. This is cool, because you are able to use the power of Kafka partition capability in addition to Spring Microservices capabilities too – Ganesh Oct 18 '18 at 14:26
As mentioned on the https://dataflow.spring.io/docs/concepts/architecture/#comparison-to-other-architectures
Comparison to Other Architectures
Spring Cloud Data Flow’s architectural style is different than other Stream and Batch processing platforms. For example in Apache Spark, Apache Flink, and Google Cloud Dataflow, applications run on a dedicated compute engine cluster. The nature of the compute engine gives these platforms a richer environment for performing complex calculations on the data as compared to Spring Cloud Data Flow, but it introduces the complexity of another execution environment that is often not needed when creating data-centric applications. That does not mean that you cannot do real-time data computations when you use Spring Cloud Data Flow. For example, you can develop applications that use the Kafka Streams API that time-sliding-window and moving-average functionality as well as joins of the incoming messages against sets of reference data.

- 2,415
- 3
- 21
- 24