0

I'm referring to the Confluent Schema Registry:

Is there reliable information on how many distinct schemas a single schema registry can support?

From how I understand the schema registry, it reads the available schemas on startup from a kafka topic.

So possible limitations could be memory consumption (= amount of schemas in memory at a time) or performance (= lookup of schemas from Kafka).

code-gorilla
  • 2,231
  • 1
  • 6
  • 21

2 Answers2

1

Internally, it uses a ConcurrentHashMap to store that information, so, in theory, the limit is roughly the max size of a backing Java array.

Do Java arrays have a maximum size?

However, there are multiple maps, and therefore, JVM heap constraints will also exist. If you have larger raw-schema strings, then more memory will be used, so there is no good calculation for this.

OneCricketeer
  • 179,855
  • 19
  • 132
  • 245
  • Interesting way of looking at that. So basically the upper bound is 32 - 2 = 30 bit because of the way schemas are stored in maps and because a 32 bit Integer is used for storing schema ids. For a rough calculation of heap memory the number of schemas times an estimated avg size could be used (plus some unknown factor for other heap memory). – code-gorilla Feb 03 '23 at 20:41
  • Schema texts are md5 hashed, and compared, so that math would be for unique schemas, not necessarily number of subjects, or matching versions between subjects – OneCricketeer Feb 04 '23 at 02:44
0

I created my own benchmark tool for finding about possible limitations. Link to Github repo is here.

TL;DR:

As suspected by @OneCricketeer, the scalability factor is the ~ nr of schemas * size of avg schema. I created a tool to see how the registry memory and cpu usage scales for registration of many different AVRO schemas of the same size (using a custom field within the schema to differentiate them). I ran the tool for ~48 schemas, for that ~900 MB of memory where used with low cpu usage.

Findings:

  • The ramp up of memory usage is a lot higher in the beginning. After the intial ramp up, the memory usage increases step-wise when new memory is allocated to hold more schemas.
  • Most of the memory is used for storing the schemas in the ConcurrentHashMap (as expected).
  • The CPU usage does not change significantly with many schemas - also not the the time to retrieve a schema.
  • There is a cache for holding RawSchema -> ParsedSchema mappings (var SCHEMA_CACHE_SIZE_CONFIG, default 1000), but at least in my tests I could not see negative impact for a cache miss, it was both in hit and miss ~1-2ms for retrieving a schema.

Memory usage (x scale = 100 schemas, y scale = 1 MB):

enter image description here

CPU usage (x scale = 100 schemas, y scale = usage in %):

enter image description here

Top 10 objects in Java heap:

 num     #instances         #bytes  class name (module)
-------------------------------------------------------
   1:        718318       49519912  [B (java.base@11.0.17)
   2:        616621       44396712  org.apache.avro.JsonProperties$2
   3:        666225       15989400  java.lang.String (java.base@11.0.17)
   4:        660805       15859320  java.util.concurrent.ConcurrentLinkedQueue$Node (java.base@11.0.17)
   5:        616778       14802672  java.util.concurrent.ConcurrentLinkedQueue (java.base@11.0.17)
   6:        264000       12672000  org.apache.avro.Schema$Field
   7:          6680       12568952  [I (java.base@11.0.17)
   8:        368958       11806656  java.util.HashMap$Node (java.base@11.0.17)
   9:         88345        7737648  [Ljava.util.concurrent.ConcurrentHashMap$Node; (java.base@11.0.17)
  10:        197697        6326304  java.util.concurrent.ConcurrentHashMap$Node (java.base@11.0.17)
code-gorilla
  • 2,231
  • 1
  • 6
  • 21