Questions tagged [apache-flink]

Apache Flink is an open source platform for scalable batch and stream data processing. Flink supports batch and streaming analytics, in one system. Analytical programs can be written in concise and elegant APIs in Java and Scala.

Apache Flink

Apache Flink is an open source platform for scalable batch and stream data processing. Flink supports batch and streaming analytics, in one system. Analytical programs can be written in concise and elegant APIs in Java and Scala.

case class WordWithCount(word: String, count: Int)

val text = env.readTextFile(path)

val counts = text.flatMap { _.split("\\W+") }
  .map { WordWithCount(_, 1) }
  .groupBy("word")
  .sum("count")

counts.writeAsCsv(outputPath)

These are some of the unique features of Flink:

  • Hybrid batch/streaming runtime that supports batch processing and data streaming programs.
  • Custom memory management to guarantee efficient, adaptive, and highly robust switching between in-memory and out-of-core data processing algorithms.
  • Flexible and expressive windowing semantics for data stream programs.
  • Built-in program optimizer that chooses the proper runtime operations for each program.
  • Custom type analysis and serialization stack for high performance.

Learn more about Flink here.

Building Apache Flink from Source

Prerequisites for building Flink:

  • Unix-like environment (We use Linux, Mac OS X, Cygwin)
  • git
  • Maven (at least version 3.0.4)
  • Java 6, 7 or 8 (Note that Oracle's JDK 6 library will fail to build Flink, but is able to run a pre-compiled package without problem)

Commands:

git clone https://github.com/apache/flink.git
cd flink
mvn clean package -DskipTests

Flink is now installed in build-target

Developing Flink

The Flink committers use IntelliJ IDEA and Eclipse IDE to develop the Flink codebase.

Minimal requirements for an IDE are:

  • Support for Java and Scala (also mixed projects)
  • Support for Maven with Java and Scala

IntelliJ IDEA

The IntelliJ IDE supports Maven out of the box and offers a plugin for Scala development.

Check out our Setting up IntelliJ guide for details.

Eclipse Scala IDE

For Eclipse users, we recommend using Scala IDE 3.0.3, based on Eclipse Kepler. While this is a slightly older version, we found it to be the version that works most robustly for a complex project like Flink.

Further details, and a guide to newer Scala IDE versions can be found in the How to setup Eclipse docs.

Note: Before following this setup, make sure to run the build from the command line once (mvn clean install -DskipTests, see above)

  1. Download the Scala IDE (preferred) or install the plugin to Eclipse Kepler. See How to setup Eclipse for download links and instructions.
  2. Add the "macroparadise" compiler plugin to the Scala compiler. Open "Window" -> "Preferences" -> "Scala" -> "Compiler" -> "Advanced" and put into the "Xplugin" field the path to the macroparadise jar file (typically "/home/-your-user-/.m2/repository/org/scalamacros/paradise_2.10.4/2.0.1/paradise_2.10.4-2.0.1.jar"). Note: If you do not have the jar file, you probably did not run the command line build.
  3. Import the Flink Maven projects ("File" -> "Import" -> "Maven" -> "Existing Maven Projects")
  4. During the import, Eclipse will ask to automatically install additional Maven build helper plugins.
  5. Close the "flink-java8" project. Since Eclipse Kepler does not support Java 8, you cannot develop this project.

Support

Don’t hesitate to ask!

Contact the developers and community on the mailing lists if you need any help.

Open an issue if you found a bug in Flink.

Documentation

The documentation of Apache Flink is located on the website: http://flink.apache.org or in the docs/ directory of the source code.

Fork and Contribute

This is an active open-source project. We are always open to people who want to use the system or contribute to it. Contact us if you are looking for implementation tasks that fit your skills. This article describes how to contribute to Apache Flink.

About

Apache Flink is an open source project of The Apache Software Foundation (ASF). The Apache Flink project originated from the Stratosphere research project.

7452 questions
163
votes
4 answers

What is/are the main difference(s) between Flink and Storm?

Flink has been compared to Spark, which, as I see it, is the wrong comparison because it compares a windowed event processing system against micro-batching; Similarly, it does not make that much sense to me to compare Flink to Samza. In both cases…
fnl
  • 4,861
  • 4
  • 27
  • 32
115
votes
3 answers

What are the benefits of Apache Beam over Spark/Flink for batch processing?

Apache Beam supports multiple runner backends, including Apache Spark and Flink. I'm familiar with Spark/Flink and I'm trying to see the pros/cons of Beam for batch processing. Looking at the Beam word count example, it feels it is very similar to…
bluenote10
  • 23,414
  • 14
  • 122
  • 178
35
votes
4 answers

difference between exactly-once and at-least-once guarantees

I'm studying distributed systems and referring to this old question: stackoverflow link I really can't understand the difference between exactly-once, at-least-once and at-most-once guarantees, I read these concepts in Kafka, Flink and Storm and…
Akinn
  • 1,896
  • 4
  • 23
  • 36
30
votes
4 answers

could not find implicit value for evidence parameter of type org.apache.flink.api.common.typeinfo.TypeInformation[...]

I am trying to write some use cases for Apache Flink. One error I run into pretty often is could not find implicit value for evidence parameter of type org.apache.flink.api.common.typeinfo.TypeInformation[SomeType] My problem is that I cant really…
jheyd
  • 599
  • 1
  • 5
  • 10
27
votes
2 answers

Apache Flink vs Apache Spark as platforms for large-scale machine learning?

Could anyone compare Flink and Spark as platforms for machine learning? Which is potentially better for iterative algorithms? Link to the general Flink vs Spark discussion: What is the difference between Apache Spark and Apache Flink?
Alexander
  • 847
  • 1
  • 9
  • 18
26
votes
3 answers

What is the difference between mini-batch vs real time streaming in practice (not theory)?

What is the difference between mini-batch vs real time streaming in practice (not theory)? In theory, I understand mini batch is something that batches in the given time frame whereas real time streaming is more like do something as the data arrives…
22
votes
3 answers

Apache Flink - Difference between Checkpoints & Save points?

Can someone please help me understand the difference between Apache Flink's Checkpoints & Savepoints. While i read the documentation, couldn't understand the difference! :s
Raja
  • 513
  • 5
  • 18
21
votes
2 answers

Some puzzles for the operator Parallelism in Flink

I just got the example below for the parallelism and have some related questions: The setParallelism(5) is setting the Parallelism 5 just to sum or both flatMap and sum? Is it possible that we can set the different Parallelism to different…
YuFeng Shen
  • 1,475
  • 1
  • 17
  • 41
20
votes
3 answers

Flink webui when running from IDE

I am trying to see my job in the web ui. I use createLocalEnvironmentWithWebUI, code is running well in IDE, but impossible to see my job in http://localhost:8081/#/overview val conf: Configuration = new Configuration() import…
GermainGum
  • 1,349
  • 3
  • 15
  • 40
20
votes
1 answer

How to implement HTTP sink correctly?

I want to send calculation results of my DataStream flow to other service over HTTP protocol. I see two possible ways how to implement it: Use synchronous Apache HttpClient client in sink public class SyncHttpSink extends…
Maxim
  • 9,701
  • 5
  • 60
  • 108
19
votes
1 answer

How to output one data stream to different outputs depending on the data?

In Apache Flink I have a stream of tuples. Let's assume a really simple Tuple1. The tuple can have an arbitrary value in it's value field (e.g. 'P1', 'P2', etc.). The set of possible values is finite but I don't know the full set beforehand…
Jan Thomä
  • 13,296
  • 6
  • 55
  • 83
18
votes
1 answer

Combine two streams in Apache Flink regardless on window time

I have two data streams that I want to combine. The problem is that one data stream has a much higher frequency than the other and there are times where one stream is not receiving events at all. Is it possible to use the last event from the one…
FLoppix
  • 215
  • 1
  • 2
  • 6
17
votes
5 answers

Kafka Client Timeout of 60000ms expired before the position for partition could be determined

I'm trying to connect Flink to a Kafka consumer I'm using Docker Compose to build 4 containers zookeeper, kafka, Flink JobManager and Flink TaskManager. For zookeeper and Kafka I'm using wurstmeister images, and for Flink I'm using the official…
Mahmoud Sultan
  • 540
  • 1
  • 5
  • 17
15
votes
2 answers

Apache Flink: java.lang.NoClassDefFoundError

I'm trying to follow this example but when I try to compile it, I have this error: Error: Unable to initialize main class com.amazonaws.services.kinesisanalytics.aws Caused by: java.lang.NoClassDefFoundError:…
15
votes
5 answers

Could not resolve substitution to a value: ${akka.stream.materializer} in AWS Lambda

I have a java application, in which I'm using the Flink Api. So basically what I'm trying to do with the code, is to create two Datasets with few records and then register them as two tables along with the necessary fields. DataSet comp =…
Kulasangar
  • 9,046
  • 5
  • 51
  • 82
1
2 3
99 100