2

The scenario is:

  • We have small messages coming from hundreds of thousands of IoT devices, sending information every second to the main Gateway of our system (infrastructure monitoring)
  • We are streaming these messages into Kafka (into 20 different topics) using the ID of the IoT sensors with a configurable partitioning formula for partitioning the data
  • We have some Kafka Connect processes, reading these messages from these 20 topics, writing into Hadoop HDFS aggregating these messages every minute, partitioning data into different HDFS staging directories (basically by groups of devices)

We would like to efficiently import all these data into Impala trying to optimize also Parquet file size for faster queries.

For now, we have two processes:

  • First process: Every 20 minutes run some code that further compact all files into a CURRENT_DAY repository and then load the data into Impala
  • Second process: Every day run some Impala SQL code to compact data in the CURRENT_DAY and then truncate the CURRENT_DAY for freeing space into the CURRENT_DAY before new data comes in

Issues:

  • The problem is that we can see data in Impala only after 20 minutes they are generated, after the First process loads the data into Impala
  • The second problem is that when the day approaches the end, the Impala queries become slower

I have found many related questions on StackOverflow, but I didn't find a general approach to this problem.

Question: Is there any general approach for this scenario that seems quite common? small data from a large number of devices and optimization of Impala queries.

Versions:

  • Hadoop = 3.1.4 (TBV)
  • Impala = 3.4.0
  • Kafka = 2.7.0 (Scala 2.13)
  • PlugIn KafkaConnect per Hadoop = kafka-connect-hdfs3:1.1.1
Mario Stefanutti
  • 232
  • 4
  • 22

0 Answers0