10

As I went to Apache Spark Streaming Website, I saw a sentence:

Spark Streaming makes it easy to build scalable fault-tolerant streaming applications.

And in Apache Flink Website, there is a sentence:

Apache Flink is an open source platform for scalable batch and stream data processing.

What means streaming application and batch data processing, stream data processing? Can you give some concrete examples? Are they designed for sensor data?

Ravindra S
  • 6,302
  • 12
  • 70
  • 108
xirururu
  • 5,028
  • 9
  • 35
  • 64
  • Most probably, Google will already have an answer for that. – maasg Jun 30 '15 at 10:16
  • Hi @maasg , actually I googled it. But I still can not follow what they mean. I think, the sensor data should also be a part of the streaming. But I don't understand why I need the "streaming" things. I can just analyse the data with any machine learning library. I think, there must be either more than what I supposed, or totally different as what I supposed. – xirururu Jun 30 '15 at 10:23
  • 2
    Streaming data refers to unbounded streams of data. Batch data means a finite data set. If you want to continuously receive and process sensor data, you need a stream processing engine. If you have sensor data that was captured for some amount of time, you should go with a batch processing engine. – Fabian Hueske Jun 30 '15 at 10:29
  • Hi @FabianHueske ,thanks very much for the answer! I still have a question, if I have a very big data set, which was already collected by sensors, (I don't receive any new data). Does it still necessary to use flink or spark streaming to analyze the data set? – xirururu Jun 30 '15 at 10:47
  • 1
    No. If your data set is of fixed size, you can (and probably should) use a batch data processor. Apache Spark and Apache Flink are both good systems for batch processing. – Fabian Hueske Jun 30 '15 at 12:14

1 Answers1

15

Streaming data analysis (in contrast to "batch" data analysis) refers to a continuous analysis of a typically infinite stream of data items (often called events).

Characteristics of Streaming Applications

Stream data processing applications are typically characterized by the following points:

  • Streaming applications run continuously, for a very long time, and consume and process events as soon as they appear. In contrast. batch applications gather data in files or databases and process it later.

  • Streaming applications frequently concern themselves with the latency of results. The latency is the delay between the creation of an event and the point when the analysis application has taken that event into account.

  • Because streams are infinite, many computations cannot refer not to the entire stream, but to a "window" over the stream. A window is a view of a sub-sequence of the stream events (such as the last 5 minutes). An example of a real world window statistic is the "average stock price over the past 3 days".

  • In streaming applications, the time of an event often plays a special role. Interpreting events with respect to their order in time is very common. While certain batch applications may do that as well, it not a core concept there.

Examples of Streaming Applications

Typical examples of stream data processing application are

  • Fraud Detection: The application tries to figure out whether a transaction fits with the behavior that has been observed before. If it does not, the transaction may indicate an attempted misuse. Typically very latency critical application.

  • Anomaly Detection: The streaming application builds a statistical model of the events it observes. Outliers indicate anomalies and may trigger alerts. Sensor data may be one source of events that one wants to analyze for anomalies.

  • Online Recommenders: If not a lot of past behavior information is available on a user that visits a web shop, it is interesting to learn from her behavior as she navigates the pages and explores articles, and to start generating some initial recommendations directly.

  • Up-to-date Data Warehousing: There are interesting articles on how to model a data warehousing infrastructure as a streaming application, where the event stream is sequence of changes to the database, and the streaming application computes various warehouses as specialized "aggregate views" of the event stream.

  • There are many more ...

Stephan Ewen
  • 2,311
  • 1
  • 17
  • 14
  • upvoted, just a quick question, do stock market price and other alerting systems fall under the stream processing category, they do have infinite data as per your answer correct? – PirateApp Sep 20 '18 at 05:39