2

Theoratically speaking, since nodejs is single threaded how can I achieve parallelism when I define multiple consumers to increase throughput?

For eg, If I have a kafka topic that has 4 partitions, on the consumer end how would I be able to consume 4 messages in parallel when used with nodejs. At most I can acheive concurrency using the singe-threaded event loop.

One possible solution would be to fork child processes (in this case 3) so that each process can receive messages from a particular partition assuming the system has 3 idle cores. But how efficient/effective this approach would be?

What would be the best possible way to achieve this?

Giorgos Myrianthous
  • 36,235
  • 20
  • 134
  • 156
laxman
  • 1,781
  • 4
  • 14
  • 32
  • While node.js is single threaded it still handles I/O in parallel because I/O does not scale with CPU cores but instead with I/O channels - in modern hardware this is PCI lanes. A consumer grade CPU generally have around 16 PCI lanes directly controlled by the CPU, a server CPU may have more than 40. Considering gaming GPU can take up to 16 PCI lanes that means for most desktops your server process will be sharing bandwidth with your graphics card, not other processes. – slebetman May 18 '20 at 09:09
  • 1
    But this depends what your code is doing. Is it calculating big matrix transformations or doing mp3 encoding? Then I/O is not your problem and node.js is not the optimal platform for your problem. Is it doing lots of network or database requests? Then node.js is a good fit and number of threads is not your problem. Number of PCI lanes is. Choose a CPU with large number of PCI lanes and remember that NVME SSD can use a maximum of 4 PCI lanes so create a raid array of NVME storage to increase the number of parallel I/O node can handle – slebetman May 18 '20 at 09:12
  • @slebetman, thanks for the tons of info. This makes a lot of sense. Also, lets say my consumer performs CPU intensive tasks per message, then would it be good idea to fork multiple worker processes? Thanks in advance :) – laxman May 18 '20 at 10:27
  • 1
    Yes, in that case the more processes the better. The first way to scale is to use the cluster module. This requires almost no modification to your normal node.js app. Node will just launch multiple processes and load balance the socket connections between them. Then if you want/need fine-grained control of threading use worker_threads module. Instead of load-balancing socket connections worker threads can accept tasks to do and asynchronously return the result when completed. This is more like traditional multi-threaded design with message passing/mailbox/message queue system – slebetman May 18 '20 at 11:07
  • 1
    However in my personal experience, a typical CRUD based, database backed website doesn't gain much when using multiple cores with the cluster module. This is because the async engine of node.js does not have much fat and so is as efficient as you can make it already. So for situations where I/O dominates using more cores won't speed up much. My own benchmarks saw under 10% performance gain going from single core to 4 or even 8 cores. – slebetman May 18 '20 at 11:10
  • To add the docs says, "Workers (threads) are useful for performing CPU-intensive JavaScript operations. They will not help much with I/O-intensive work. Node.js’s built-in asynchronous I/O operations are more efficient than Workers can be. Unlike child_process or cluster, worker_threads can share memory. They do so by transferring ArrayBuffer instances or sharing SharedArrayBuffer instances." – laxman May 18 '20 at 11:23

1 Answers1

0

In Kafka, partitions are the level of parallelism. Furthermore, the more partitions there are in a Kafka cluster, the higher the throughput one can achieve.

A Kafka topic is divided into a number of partitions which enables parallelism by splitting the data across multiple brokers. Multiple partitions enable multiple consumers to read from a topic in parallel. Therefore, in order to achieve parallel processing you need to partition your topic into more than one partitions.

In order to increase the number of partitions of an existing topic you can simply run

bin/kafka-topics.sh \
    --zookeeper localhost:2181 \
    --alter \
    --topic topicName \
    --partitions 40

This won't move existing data, though


Note on consumers, consumer groups and partitions
If you have N partitions, then you can have up to N consumers within the same consumer group each of which reading from a single partition. When you have less consumers than partitions, then some of the consumers will read from more than one partition. Also, if you have more consumers than partitions then some of the consumers will be inactive and will receive no messages at all.

Giorgos Myrianthous
  • 36,235
  • 20
  • 134
  • 156