0

I have a function where certain data is being processed, and if the data meets a certain criteria, it's to be handled separate while the rest of the data is being processed.

As an arbitrary example if I'm scraping a web page and collecting all the attributes of an element, one of the elements is a form and just so happens to be hidden, I want to handle it separate, while the rest of the elements can continue being processed:

def get_hidden_forms(element_att):
    if element_att == 'hidden':
        os.fork()
        # handle this seperate
    else:
        # continue handling any elements that are not hidden
    #join both processes

Can this be done with os.fork() or is it intended for another purpose?

I know that os.fork() copies everything about the object, but I could just change values before forking, as stated in this post.

  • Anything against using the `multiprocessing` module? Why go all the way down to `os.fork()`? – yorodm Jan 21 '19 at 15:14
  • @yorodm No, I have nothing against the multiprocessing module, I'm not sure what led you to think that, but after reading the docs on the module I just thought that os.fork() would probably suit my needs a little better. –  Jan 21 '19 at 15:25
  • That's exactly what I meant by "anything against it" (a.k.a doesn't work for you) – yorodm Jan 21 '19 at 16:10
  • @aeaglez I'm with yorodm on this one; `os.fork` is really low-level by python standards, and it usually exists to fill a specific niche. `multiprocessing` offers a saner API built on top of the `fork` API. – 0xdd Jan 21 '19 at 16:12

1 Answers1

1

fork basically creates a clone of the process calling it with a new address space and new PID.

From that point on, both processes would continue running next instruction after the fork() call. For this purpose, you normally inspect it's return value and decide what is appropriate action. If it return int greater than 0, it's the PID of child process and you know you are in its parent... you continue parents work. If it's equal to 0, you are in a child process and should do child's work. Value less then 0 means fork has failed, Python would handle that and raise OSError which you should handle (you're still in and there only is a parent).

Now the absolute minimum you'd need to take care of having forked a child process is to also make sure you wait() for them and reap their return codes properly, otherwise you will (at least temporarily) create zombies. That actually means you may want to implement a SICHLD handler to reap your process' children remains as they are done with their execution.

In theory you could use it the way you've described, but it may be a bit too "low level" (and uncomfortable) for that and perhaps would be easier to do and read/understand if you had dedicated code for what you want to handle separately and use multiprocessing to handle running this extra work in separate processes.

Ondrej K.
  • 8,841
  • 11
  • 24
  • 39
  • There is dedicated code for it, the only thing I'm contemplating is how they would join together in one process again. For multiprocessing I could just spawn a new process, keep my current one running, and before the critical section is over, join them? Would that suffice? –  Jan 21 '19 at 15:23
  • Short version is yes. [`.join()`](https://docs.python.org/3/library/multiprocessing.html#multiprocessing.Process.join) would wait for the other process to finish (if it has not done so in the meanwhile). Unless you really need the low-level control, I'd prefer comfort of greater abstraction which should allow for simpler code. – Ondrej K. Jan 21 '19 at 19:57