1

I have DataFrame, and I want to write the records from DataFrame to Kafka topic. Currently I am doing it like below :

//perform avg/groupby operation on DF return result DF
DataFrame outputDF = dataOperationService.performOperation(dataFrame);
List<Row> rows = outputDF.collectAsList();
KafkaProducer<String, String> producer = initKafkaProducer();
try {
    for (Row row : rows) {
        String deptName = row.getString(0);
        double salary = row.getDouble(1);
        producer.send(new ProducerRecord<>(topicName, deptName+ "," + salary)).get();
        }
} catch (Exception ignored) {
    System.out.println("Something went wrong.");
} finally {
    producer.close();
}

Since I am collecting DataFrame as List so it will bring all the rows into JVM memory. And it will be serious performance impact if there are millions of rows. So I feel its not the correct way to do it. I am not sure how to do it? I heard of something called as Broadcast variable so is that the way of doing it? how can I fit it in my example?

eatSleepCode
  • 4,427
  • 7
  • 44
  • 93

0 Answers0