1

I have a process which creates a new child process for handling certain tasks. There's also one child thread created by the main process which is used for one specific task. These processes and thread communicate through UNIX domain sockets.

Due to some reason, the child process got stuck and it remains in umtx state. Consequently, whenever the main process tries to send some data to the child thread which is still in wait state and waiting for the child process to exit, does not respond.

Ultimately, the tx queue of parent socket keeps on accumulating messages and becomes full. This results in send() failing with ENOBUFS error.

How can I overcome this issue?

Mecki
  • 125,244
  • 33
  • 244
  • 253
pseudonym
  • 31
  • 9
  • @hellow : Thanks for the instant response. Is it correct to add the following snippet before the send() ? if (fflush((FILE *)&communication_sockets[0])) { printf("fflush failed"); } – pseudonym Aug 27 '18 at 08:35
  • I'm not convinced that this is defined, or possible, for datagram sockets, at least from the sending end. If I wanted to 'flush' a protocol based on datagrams, then I would probably add a 'flush' exchange to the protocol, where the sender issues a 'flush comms' mesage to the peer. Upon receipt, the peer reads and discards all pending datagrams,. waits a long time, reads and discards all pending datagrams again and then sends a 'flush complete' reply to the originator. – Martin James Aug 27 '18 at 09:27
  • @pseudonym: You are using an Unix domain datagram socket, which preserves message boundaries. This means the data written (`send()`) cannot mix with previously written data, nor with any data written in the future; the kernel takes care of ensuring that. However, the write side cannot control what the read side already has; what is sent, is sent, and it is outside the control of the sender. (If it was otherwise, malicious programs would try to "un-send" reports of their behaviour...) Therefore, what you seem to be trying to do, makes no sense at all to me. – Nominal Animal Aug 27 '18 at 10:39
  • AFIK, with blocking datagram AF_UNIX sockets, send() will block until the datagram is stored to the (destination) queue. If there is space in the queue, it won't block. So, there is no need to flush an output queue, because there isn't any output queue. – SKi Aug 27 '18 at 12:34
  • Hello everyone, I have actually changed the question to describe my exact problem scenario. Please help me with it. – pseudonym Aug 28 '18 at 07:16
  • @pseudonym *Due to some reason, the child process got stuck and it remains in umtx state.* Fix **that**. With no process able to read from the Unix socket, where would data be flushed to when its buffer is full? – Andrew Henle Aug 28 '18 at 07:34
  • @AndrewHenle Thanks for the suggestion. But that's actually a rare system failure which is very unlikely to happen. Is there any other way to solve this problem ? – pseudonym Aug 28 '18 at 07:40

1 Answers1

0

How can I overcome this issue?

Find out why your child process gets stuck and prevent this from happening, then the socket queue won't run full. This is the cause of your entire problem, isn't it?

How else would you like to overcome it? Giving the IPC socket a huge buffer will only make it accumulate more data but in the end it will fail the same way. So this would not solve anything, it would just delay the problem.

Also handle the error in a meaningful way, like try sending the message again after a while and if it fails again, kill the child process. This way your main process can at leas recover when your child process fails.

And last but not least: You should not use UNIX sockets for inter-thread communication. It's intended to be used for inter-process communication. Using it inter-thread can have undesired effects and is strongly discouraged, unless you do it for testing purposes.

Mecki
  • 125,244
  • 33
  • 244
  • 253