3

We have a server which must be able to be updated without downtime.

We achieve this by making the application to be only a thin loader layer and all the logic is in the shared object which is dlopen-ed by the application. Let's call this lib as libmyservice.so.1.23.

When a request comes in the server creates a thread and calls the appropriate APIs from the lib to serve it.

When the server needs to be hot updated it downloads a new set of libraries and loads them with dlopen let's call the new library as libmyservice.so.1.24.

During the update there is an intermediate period where the already running requests are still served by the old library while the new requests are served by the new. When the old requests finish the old library is unloaded and all requests will use the new library. So there is no downtime during the update.

The library is compiled to be as self contained as possible. It depends on boost, openssl and many other C++ stuff we don't have control over. All these dependencies are shipped along with the library and rpath is used to load them from the same directory as the lib.

In practice we ran into two problems:

  • Symbol conflicts: when the new library is loaded the dynamic linker reuses the symbols from the old this can cause weird bugs. We have found this can be worked around by passing RTLD_DEEPBIND to the dlopen. But we also have dlmopen which does a similar thing. Which of the two should be used when the goal is to completely isolate the old and new set of libraries?

  • Static initialization and cleanup: we had a bug that using global objects such as std::cerr from the dlopened library caused crash when the library loaded using RTLD_DEEPBIND. We still don't fully understand why does this happen. We believe that static initialization doesn't happen when a new set of libstdc++'s symbols is loaded when RTLD_DEEPBIND is used. This seems to be a bug in the dynamic linker to me. How static initialization and cleanup supposed to work in shared objects?

How can I load two shared libraries and all their dependencies simultaneously without any conflict between the two? Is it even possible? That is no symbol conflicts, no conflict in the static initialization.

The software works properly already on Windows because DLLs doesn't seem to conflict with each other like shared objects do.

EDIT:

Although I like the idea of multiple processes, the architecture is already decided and I can't change it without many approvals (which I unlikely to get).

To make matters complicated (which I didn't want in the original question), the actual architecture is like this: server application -> main library -> hot updateble core libraries. The "main library" is the product which is a proxy above a hot swappable core libs. The customer gets this library and integrates it to their product. During update the customer's product download the update package and calls a function in the main library to trigger the hot swap. The main library doesn't do the threads, but it's implemented to be thread safe.

So multiple processes are not really the option.

Calmarius
  • 18,570
  • 18
  • 110
  • 157
  • Not exactly a solution to your problem but a cleaner way could be creating a new process and loading new SO with that. After that passing all the new requests to a new process and once all requests are done then old process can totally handover connection etc. to new process before exiting itself. – Ashutosh Raghuwanshi Jan 07 '21 at 12:40

1 Answers1

2

I'm not going to answer the question, but this advice is too long to fit in a comment... TL;DR: beware the XY problem :)

I don't know whether dynamic library reloading can be done, but as you noticed, it is a source of trouble even if you can get it to work. So I would consider a different and more customary approach.

You could create a "super-server" executable whose task is only to listen for incoming requests and hand them off to the actual server.

When a new connection comes in, the super-server forks into a new process (which inherits the connection's file descriptor) and execs the request handler executable. In the simplest case, you don't even need to write this; xinetd does exactly that. Systemd can do it too, using socket activation.

If there is too much overhead from having one process per request, you could also code your super-server to hand off the connection's file descriptor to an existing long-running server process. The actual server can then handle the request however it wants, for example using threads, or a select()-based reactor. The super-server can shut down the server by sending it a SIGTERM signal; the server should be coded to handle this by finishing all in-flight requests and then shutting down.

Finally, if you really can't tolerate any downtime, none of this trickery will save you if your server is running on only one machine and the machine goes down (or needs to be rebooted for a kernel update). You'll need redundancy in the form of multiple servers. At that point you might as well just upgrade these servers by taking them down one by one, and none of this hot-swapping is actually needed.

Thomas
  • 174,939
  • 50
  • 355
  • 478
  • Continued: To make matters complicated (which I didn't want in the original question), the actual architecture is lile this: server application -> main library -> hot updateble core libraries. The "main library" is the product which is a proxy above a hot swappable core libs. The customer gets this library and integrates it to their product. During update the customer's product download the update package and calls a function in the main library to trigger the hot swap. The main library doesn't do the threads, but it's implemented to be thread safe. – Calmarius Jan 07 '21 at 14:19
  • Still I think the notion of a "companion" process that is safely restartable would be easier to implement correctly than hot-swapping code at runtime. – Thomas Jan 08 '21 at 11:22