8

To perform source data preparation, data transformation or data cleansing, in what scenario should we use Dataprep vs Dataflow vs Dataproc?

Ryan Yuan
  • 2,396
  • 2
  • 13
  • 23
  • Linking: https://stackoverflow.com/questions/56329619/what-are-the-differences-between-cloud-dataflow-and-dataprep – blong Aug 08 '19 at 18:20

3 Answers3

9

Data preparation/transformation/cleaning tasks can all be seen as ETL processes, implementable with any of the products you mention. This older answer covers the basics of the Dataflow vs Dataproc question and includes this link which summarises what you should keep in mind when choosing between these three.

In brief, you should consider familiarity (have you already worked with Hadoop-ecosystem tools? the beam programming model? would you rather work via a UI?) and desired level of control (dataproc allows more control over the cluster, dataflow and dataprep are fully managed services).

More good reads:

Adam Ocsvari
  • 8,056
  • 2
  • 17
  • 30
Lefteris S
  • 1,614
  • 1
  • 7
  • 14
3

Both Dataproc and Dataflow are data processing services on google cloud. What is common about both systems is they can both process batch or streaming data. Both also have workflow templates that are easier to use. But below are the distinguishing features about the two

Dataproc is designed to run on clusters. Which makes it compatible with Apache Hadoop, hive and spark. It is significantly faster at creating clusters and can auto scale clusters without interruption of running job.

Dataflow is better if your data has no implementation with spark or Hadoop. It does not run on clusters, instead it is based on parallel data processing. As such data is split processed on multiple microprocessors to reduce processing time.

ama
  • 53
  • 1
  • 7
-2

an Important note about Dataproc is, Dataprep provides data cleaning and automatically identifies anomalies in the data. It is integrated with Cloud Storage, BigTable and and BigQuery