17

Apache Apex - is an open source enterprise grade unified stream and batch processing platform. It is used in GE Predix platform for IOT. What are the key differences between these 2 platforms?

Questions

  1. From a data science perspective, how is it different from Spark?
  2. Does Apache Apex provide functionality like Spark MLlib? If we have to built scalable ML models on Apache apex how to do it & which language to use?
  3. Will data scientists have to learn Java to built scalable ML models? Does it have python API like pyspark?
  4. Can Apache Apex be integrated with Spark and can we use Spark MLlib on top of Apex to built ML models?
Netanel Malka
  • 346
  • 4
  • 11
GeorgeOfTheRF
  • 8,244
  • 23
  • 57
  • 80

1 Answers1

22
  1. Apache Apex an engine for processing streaming data. Some others which try to achieve the same are Apache storm, Apache flink. Differenting factor for Apache Apex is: it comes with built-in support for fault-tolerance, scalability and focus on operability which are key considerations in production use-cases.

Comparing it with Spark: Apache Spark is actually a batch processing. If you consider Spark streaming (which uses spark underneath) then it is micro-batch processing. In contrast, Apache apex is a true stream processing. In a sense that, incoming record does NOT have to wait for next record for processing. Record is processed and sent to next level of processing as soon as it arrives.

  1. Currently, work is under progress for adding support for integration of Apache Apex with machine learning libraries like Apache Samoa, H2O Refer https://issues.apache.org/jira/browse/SAMOA-49

  2. Currently, it has support for Java, Scala.
    https://www.datatorrent.com/blog/blog-writing-apache-apex-application-in-scala/ For Python, you may try it using Jython. But, I haven't not tried it myself. So, not very sure about it.

  3. Integration with Spark may not be good idea considering they are two different processing engines. But, Apache apex integration with Machine learning libraries is under progress.

If you have any other questions, requests for features you can post them on mailing list for apache apex users: https://mail-archives.apache.org/mod_mbox/incubator-apex-users/

Yogi Devendra
  • 711
  • 1
  • 4
  • 18
  • Thanks! Can you give explain your statment on micro-batch processing? You mean to say in "micro-batch", the incoming record will be processed only after the next record arrives, were as in Apex the records doesnt have to wait for any processing? – GeorgeOfTheRF Feb 24 '16 at 11:18
  • Also as of today running scalable ML on Apex is not an option right? Is Apex written in scala natively like spark? – GeorgeOfTheRF Feb 24 '16 at 11:19
  • 2
    To answer your questions, 1) Yes, Apex processes the records as they arrive, you don't have to wait. Whereas Spark waits for a chunk of records to arrive before processing them. 2) Currently Apex has no ML implementation 3) Apex is natively written in Java, with support for Scala – PradeepKumbhar Feb 25 '16 at 05:47
  • @ML_Pro http://spark.apache.org/docs/latest/streaming-programming-guide.html says the following about Spark Streaming: Internally, it works as follows. Spark Streaming receives live input data streams and divides the data into **batches**, which are then processed by the Spark engine to generate the final stream of results in batches. – Yogi Devendra Feb 25 '16 at 08:04
  • 2
    I think you meant "In a sense that, incoming record does NOT have to wait for next record for processing." – nir Apr 19 '17 at 16:01