Confluent Schema Registry - maximum number of schemas

Question

I'm referring to the Confluent Schema Registry:

Is there reliable information on how many distinct schemas a single schema registry can support?

From how I understand the schema registry, it reads the available schemas on startup from a kafka topic.

So possible limitations could be memory consumption (= amount of schemas in memory at a time) or performance (= lookup of schemas from Kafka).

score 1 · Answer 1 · answered Feb 03 '23 at 20:04

1

Internally, it uses a ConcurrentHashMap to store that information, so, in theory, the limit is roughly the max size of a backing Java array.

Do Java arrays have a maximum size?

However, there are multiple maps, and therefore, JVM heap constraints will also exist. If you have larger raw-schema strings, then more memory will be used, so there is no good calculation for this.

answered Feb 03 '23 at 20:04

OneCricketeer

179,855
19
132
245

Interesting way of looking at that. So basically the upper bound is 32 - 2 = 30 bit because of the way schemas are stored in maps and because a 32 bit Integer is used for storing schema ids. For a rough calculation of heap memory the number of schemas times an estimated avg size could be used (plus some unknown factor for other heap memory). – code-gorilla Feb 03 '23 at 20:41
Schema texts are md5 hashed, and compared, so that math would be for unique schemas, not necessarily number of subjects, or matching versions between subjects – OneCricketeer Feb 04 '23 at 02:44

score 0 · Accepted Answer · answered Feb 13 '23 at 07:51

I created my own benchmark tool for finding about possible limitations. Link to Github repo is here.

TL;DR:

As suspected by @OneCricketeer, the scalability factor is the ~ nr of schemas * size of avg schema. I created a tool to see how the registry memory and cpu usage scales for registration of many different AVRO schemas of the same size (using a custom field within the schema to differentiate them). I ran the tool for ~48 schemas, for that ~900 MB of memory where used with low cpu usage.

Findings:

The ramp up of memory usage is a lot higher in the beginning. After the intial ramp up, the memory usage increases step-wise when new memory is allocated to hold more schemas.
Most of the memory is used for storing the schemas in the ConcurrentHashMap (as expected).
The CPU usage does not change significantly with many schemas - also not the the time to retrieve a schema.
There is a cache for holding RawSchema -> ParsedSchema mappings (var SCHEMA_CACHE_SIZE_CONFIG, default 1000), but at least in my tests I could not see negative impact for a cache miss, it was both in hit and miss ~1-2ms for retrieving a schema.

Memory usage (x scale = 100 schemas, y scale = 1 MB):

CPU usage (x scale = 100 schemas, y scale = usage in %):

Top 10 objects in Java heap:

 num     #instances         #bytes  class name (module)
-------------------------------------------------------
   1:        718318       49519912  [B (java.base@11.0.17)
   2:        616621       44396712  org.apache.avro.JsonProperties$2
   3:        666225       15989400  java.lang.String (java.base@11.0.17)
   4:        660805       15859320  java.util.concurrent.ConcurrentLinkedQueue$Node (java.base@11.0.17)
   5:        616778       14802672  java.util.concurrent.ConcurrentLinkedQueue (java.base@11.0.17)
   6:        264000       12672000  org.apache.avro.Schema$Field
   7:          6680       12568952  [I (java.base@11.0.17)
   8:        368958       11806656  java.util.HashMap$Node (java.base@11.0.17)
   9:         88345        7737648  [Ljava.util.concurrent.ConcurrentHashMap$Node; (java.base@11.0.17)
  10:        197697        6326304  java.util.concurrent.ConcurrentHashMap$Node (java.base@11.0.17)

Confluent Schema Registry - maximum number of schemas

2 Answers2