3

I am not very clear about the architecture even after going through tutorials. How do we scale streamset in a distributed environment? Let's say, our input data velocity increases from origin then how to ensure that SDC doesn't give performance issues? How many daemons will be running? Will it be Master worker architecture or peer to peer architecture?

If there are multiple daemons running on multiple machines (e.g. one sdc along with one NodeManager in YARN) then how it will show centralized view of data i.e. total record count etc.?

Also please do let me know architecture of Dataflow performance manager. Which all daemons are there in this product?

metadaddy
  • 4,234
  • 1
  • 22
  • 46
Aman Raturi
  • 99
  • 1
  • 8
  • Can you clarify a bit more about the concern/question about deamons, and also what you mean there? Are you talking about [deamon threads](https://docs.oracle.com/javase/8/docs/api/java/lang/Thread.html#isDaemon--) specifically? If so, do you have particular concerns regarding daemon threads? In Java, they behave almost identically to normal threads w.r.t. resource consumption, etc. which is why I'm wondering. – Jeff Evans Dec 08 '17 at 20:54

1 Answers1

3

StreamSets Data Collector (SDC) scales by partitioning the input data. In some cases, this can be done automatically, for example Cluster Batch mode runs SDC as a MapReduce job on the Hadoop / MapR cluster to read Hadoop FS / MapR FS data, while Cluster Streaming mode leverages Kafka partitions and executes SDC as a Spark Streaming application to run as many pipeline instances as there are Kafka partitions.

In other cases, StreamSets can scale by multithreading - for example, the HTTP Server and JDBC Multitable Consumer origins run multiple pipeline instances in separate threads.

In all cases, Dataflow Performance Manager (DPM) can give you a centralized view of the data, including total record count.

metadaddy
  • 4,234
  • 1
  • 22
  • 46
  • So does it mean SDC need an external hadoop or spark cluster? SDC instance running separately and launching jobs on the cluster? Ex: One of the use case is, we are receiving 1000 files from different upstream systems [by scp] a day in parallel and they are relatively quite huge, lets say each file is 1-10GB in size, and we have to apply some transformations on all those files, later we do some joining and aggregation [separating this task out of SDC], Does it required a big machine with cores & memory on single node? is it Possible to setup a cluster with SDC instances like how NiFi does? – uday Apr 04 '18 at 15:01
  • 2
    @Uday, Streamsets advantages lies in **streaming data** and not in hard core ETL tool. Alan Shalloway compares his car to an umbrella in his book **Design Patterns Explained** we use both to stay dry in the rain, but the umbrella has an advantage of being light and fold-able, but the car has wheels and can protect more than one person. Ofcourse Streamsets can do some of ETL work for you, but for big files, its good to use Apache spark processor. – Ash May 21 '18 at 23:31
  • 1
    @Uday, Finally, to answer your question, yes you can copy the files using **whole file** data format option. – Ash May 22 '18 at 00:26