Using multiprocessing module to runs parallel processes where one is fed (dependent) by the other for Viterbi Algorithm

Question

I have recently played around with Python's multiprocessing module to speed up the forward-backward algorithm for Hidden Markov Models as forward filtering and backward filtering can run independently. Seeing the run-time halve was awe-inspiring stuff.

I now attempt to include some multiprocessing in my iterative Viterbi algorithm.In this algorithm, the two processes I am trying to run are not independent. The val_max part can run independently but arg_max[t] depends on val_max[t-1]. So I played with the idea that one can run val_max as a separate process and then arg_max also as a separate process which can be fed by val_max.

I admit to be a bit out of my depth here and do not know much about multiprocessing other than watching some basic video's on it as well as browsing blogs. I provide my attempt below, but it does not work.


import numpy as np
from time import time,sleep
import multiprocessing as mp

class Viterbi:


    def __init__(self,A,B,pi):
        self.M = A.shape[0] # number of hidden states
        self.A = A  # Transition Matrix
        self.B = B   # Observation Matrix
        self.pi = pi   # Initial distribution
        self.T = None   # time horizon
        self.val_max = None
        self.arg_max = None
        self.obs = None
        self.sleep_time = 1e-6
        self.output = mp.Queue()


    def get_path(self,x):
        # returns the most likely state sequence given observed sequence x
        # using the Viterbi algorithm
        self.T = len(x)
        self.val_max = np.zeros((self.T, self.M))
        self.arg_max = np.zeros((self.T, self.M))
        self.val_max[0] = self.pi*self.B[:,x[0]]
        for t in range(1, self.T):
            # Indepedent Process
            self.val_max[t] = np.max( self.A*np.outer(self.val_max[t-1],self.B[:,obs[t]]) , axis = 0  ) 
            # Dependent Process
            self.arg_max[t] = np.argmax( self.val_max[t-1]*self.A.T, axis = 1)

        # BACKTRACK
        states = np.zeros(self.T, dtype=np.int32)
        states[self.T-1] = np.argmax(self.val_max[self.T-1])
        for t in range(self.T-2, -1, -1):
            states[t] = self.arg_max[t+1, states[t+1]]
        return states

    def get_val(self):
        '''Independent Process'''
        for t in range(1,self.T):
            self.val_max[t] = np.max( self.A*np.outer(self.val_max[t-1],self.B[:,self.obs[t]]) , axis = 0  ) 
        self.output.put(self.val_max)

    def get_arg(self):
        '''Dependent Process'''
        for t in range(1,self.T):
            while 1:
                # Process info if available
                if self.val_max[t-1].any() != 0:
                    self.arg_max[t] = np.argmax( self.val_max[t-1]*self.A.T, axis = 1)
                    break
                # Else sleep and wait for info to arrive
                sleep(self.sleep_time)
        self.output.put(self.arg_max)

    def get_path_parallel(self,x):
        self.obs = x
        self.T = len(obs)
        self.val_max = np.zeros((self.T, self.M))
        self.arg_max = np.zeros((self.T, self.M))
        val_process = mp.Process(target=self.get_val)
        arg_process = mp.Process(target=self.get_arg)  
        # get first initial value for val_max which can feed arg_process
        self.val_max[0] = self.pi*self.B[:,obs[0]]
        arg_process.start()
        val_process.start()
        arg_process.join()
        val_process.join()

Note: get_path_parallel does not have backtracking yet.

It would seem that val_process and arg_process never really run. Really not sure why this happens. You can run the code on the Wikipedia example for the viterbi algorithm.

obs = np.array([0,1,2])  # normal then cold and finally dizzy  

pi = np.array([0.6,0.4])

A = np.array([[0.7,0.3],
             [0.4,0.6]])

B = np.array([[0.5,0.4,0.1],
             [0.1,0.3,0.6]]) 

viterbi = Viterbi(A,B,pi)
path = viterbi.get_path(obs)

I also tried using Ray. However, I had no clue what I was really doing there. Can you please help recommend me what to do in order to get the parallel version to run. I must be doing something very wrong but I do not know what.

Your help would be much appreciated.

Welcome to SO. Consider taking a look at [producer-consumer pattern](https://stonesoupprogramming.com/2017/09/11/python-multiprocessing-producer-consumer-pattern/) that is heavily used in multiprocessing. — Sıddık Açıl, Jun 23 '19 at 15:29
@SıddıkAçıl thank you so very much. I was not aware of such a well defined pattern. This is very helpful indeed. I will give this a read and attempt to fix my code. — Dylan Solms, Jun 23 '19 at 15:33
@SıddıkAçıl I have managed to get the code working due to your great link. I also appreciate your answer very much as it provided great insight for me to learn from. My code is much slower than the serial version most likely do to the fact that the processes I am attempting to separate are already so fast and small that the overhead of concurrency is not worth it. I have included the working code as an answer to this question along with some of my thoughts. — Dylan Solms, Jun 24 '19 at 14:12

score 1 · Answer 1 · answered Jun 24 '19 at 14:06

I have managed to get my code working thanks to @SıddıkAçıl. The producer-consumer pattern is what does the trick. I also realised that the processes can run successfully but if one does not store the final results in a "result queue" of sorts then it vanishes. By this I mean, that I filled in values in my numpy arrays val_max and arg_max by allowing the process to start() but when I called them, they were still np.zero arrays. I verified that they did fill up to the correct arrays by printing them just as the process is about to terminate (at last self.T in iteration). So instead of printing them, I just added them to a multiprocessing Queue object on the final iteration to capture then entire filled up array.

I provide my updated working code below. NOTE: it is working but takes twice as long to complete as the serial version. My thoughts on why this might be so is as follows:

I can get it to run as two processes but don't actually know how to do it properly. Experienced programmers might know how to fix it with the chunksize parameter.
The two processes I am separating are numpy matrix operations. These processes execute so fast already that the overhead of concurrency (multiprocessing) is not worth the theoretical improvement. Had the two processes been the two original for loops (as used in Wikipedia and most implementations) then multiprocessing might have given gains (perhaps I should investigate this). Furthermore, because we have a producer-consumer pattern and not two independent processes (producer-producer pattern) we can only expect the producer-consumer pattern to run as long as the longest of the two processes (in this case the producer takes twice as long as the consumer). We can not expect run time to halve as in the producer-producer scenario (this happened with my parallel forward-backward HMM filtering algorithm).
My computer has 4 cores and numpy already does built-in CPU multiprocessing optimization on its operations. By me attempting to use cores to make the code faster, I am depriving numpy of cores that it could use in a more effective manner. To figure this out, I am going to time the numpy operations and see if they are slower in my concurrent version as compared to that of my serial version.

I will update if I learn anything new. If you perhaps know the real reason for why my concurrent code is so much slower, please do let me know. Here is the code:


import numpy as np
from time import time
import multiprocessing as mp

class Viterbi:


    def __init__(self,A,B,pi):
        self.M = A.shape[0] # number of hidden states
        self.A = A  # Transition Matrix
        self.B = B   # Observation Matrix
        self.pi = pi   # Initial distribution
        self.T = None   # time horizon
        self.val_max = None
        self.arg_max = None
        self.obs = None
        self.intermediate = mp.Queue()
        self.result = mp.Queue()



    def get_path(self,x):
        '''Sequential/Serial Viterbi Algorithm with backtracking'''
        self.T = len(x)
        self.val_max = np.zeros((self.T, self.M))
        self.arg_max = np.zeros((self.T, self.M))
        self.val_max[0] = self.pi*self.B[:,x[0]]
        for t in range(1, self.T):
            # Indepedent Process
            self.val_max[t] = np.max( self.A*np.outer(self.val_max[t-1],self.B[:,obs[t]]) , axis = 0  ) 
            # Dependent Process
            self.arg_max[t] = np.argmax( self.val_max[t-1]*self.A.T, axis = 1)

        # BACKTRACK
        states = np.zeros(self.T, dtype=np.int32)
        states[self.T-1] = np.argmax(self.val_max[self.T-1])
        for t in range(self.T-2, -1, -1):
            states[t] = self.arg_max[t+1, states[t+1]]
        return states

    def get_val(self,intial_val_max):
        '''Independent Poducer Process'''
        val_max = intial_val_max
        for t in range(1,self.T):
            val_max = np.max( self.A*np.outer(val_max,self.B[:,self.obs[t]]) , axis = 0  )
            #print('Transfer: ',self.val_max[t])
            self.intermediate.put(val_max)
            if t == self.T-1:
                self.result.put(val_max)   # we only need the last val_max value for backtracking




    def get_arg(self):
        '''Dependent Consumer Process.'''
        t = 1
        while t < self.T:
            val_max =self.intermediate.get()
            #print('Receive: ',val_max)
            self.arg_max[t] = np.argmax( val_max*self.A.T, axis = 1)
            if t == self.T-1:
                self.result.put(self.arg_max)
            #print('Processed: ',self.arg_max[t])
            t += 1

    def get_path_parallel(self,x):
        '''Multiprocessing producer-consumer implementation of Viterbi algorithm.'''
        self.obs = x
        self.T = len(obs)
        self.arg_max = np.zeros((self.T, self.M))  # we don't tabulate val_max anymore
        initial_val_max = self.pi*self.B[:,obs[0]]
        producer_process = mp.Process(target=self.get_val,args=(initial_val_max,),daemon=True)
        consumer_process = mp.Process(target=self.get_arg,daemon=True) 
        self.intermediate.put(initial_val_max)  # initial production put into pipeline for consumption
        consumer_process.start()  # we can already consume initial_val_max
        producer_process.start()
        #val_process.join()
        #arg_process.join()
        #self.output.join()
        return self.backtrack(self.result.get(),self.result.get()) # backtrack takes last row of val_max and entire arg_max

    def backtrack(self,val_max_last_row,arg_max):
        '''Backtracking the Dynamic Programming solution (actually a Trellis diagram)
           produced by Multiprocessing Viterbi algorithm.'''
        states = np.zeros(self.T, dtype=np.int32)
        states[self.T-1] = np.argmax(val_max_last_row)
        for t in range(self.T-2, -1, -1):
            states[t] = arg_max[t+1, states[t+1]]
        return states



if __name__ == '__main__':

    obs = np.array([0,1,2])  # normal then cold and finally dizzy  

    T = 100000
    obs = np.random.binomial(2,0.3,T)        

    pi = np.array([0.6,0.4])

    A = np.array([[0.7,0.3],
                 [0.4,0.6]])

    B = np.array([[0.5,0.4,0.1],
                 [0.1,0.3,0.6]]) 

    t1 = time()
    viterbi = Viterbi(A,B,pi)
    path = viterbi.get_path(obs)
    t2 = time()
    print('Iterative Viterbi')
    print('Path: ',path)
    print('Run-time: ',round(t2-t1,6)) 
    t1 = time()
    viterbi = Viterbi(A,B,pi)
    path = viterbi.get_path_parallel(obs)
    t2 = time()
    print('\nParallel Viterbi')
    print('Path: ',path)
    print('Run-time: ',round(t2-t1,6))

Hello again Dylan. You have done a tremendous job on your code compared to your first draft. Regarding to your 2nd and 3rd points, you are right. Numpy already uses processor intrinsics to make things faster and is heavily optimised. Therefore, you introducing multiprocessing into the equation does not do you any good because of [fork](https://stackoverflow.com/questions/985051/what-is-the-purpose-of-fork) overhead. What I would suggest you to do is to check out Numba and its @jit decorator. This will not teach you multiprocessing, but it surely does boost up your speed. — Sıddık Açıl, Jun 24 '19 at 15:03
What I would suggest you to do is to get well-versed in multithreading/multiprocessing patterns. Look up MPI and OpenMP. Explore new stuff in C/C++. Look up SIMD and CUDA and how they are used in Numpy and CuPy. I am glad I was of help to you. — Sıddık Açıl, Jun 24 '19 at 15:07

Sıddık Açıl · Accepted Answer · 2019-06-24T14:59:37.613

Welcome to SO. Consider taking a look at producer-consumer pattern that is heavily used in multiprocessing.

Beware that multiprocessing in Python reinstantiates your code for every process you create on Windows. So your Viterbi objects and therefore their Queue fields are not the same.

Observe this behaviour through:

import os

def get_arg(self):
    '''Dependent Process'''
    print("Dependent ", self)
    print("Dependent ", self.output)
    print("Dependent ", os.getpid())

def get_val(self):
    '''Independent Process'''
    print("Independent ", self)
    print("Independent ", self.output)
    print("Independent ", os.getpid())

if __name__ == "__main__":
    print("Hello from main process", os.getpid())
    obs = np.array([0,1,2])  # normal then cold and finally dizzy  

    pi = np.array([0.6,0.4])

    A = np.array([[0.7,0.3],
             [0.4,0.6]])

    B = np.array([[0.5,0.4,0.1],
             [0.1,0.3,0.6]]) 

    viterbi = Viterbi(A,B,pi)
    print("Main viterbi object", viterbi)
    print("Main viterbi object queue", viterbi.output)
    path = viterbi.get_path_parallel(obs)

There are three different Viterbi objects as there are three different processes. So, what you need in terms of parallelism is not processes. You should explore the threading library that Python offers.

I have managed to get the code working due to your great link. I also appreciate your answer very much as it provided great insight for me to learn from. My code is much slower than the serial version most likely do to the fact that the processes I am attempting to separate are already so fast and small that the overhead of concurrency is not worth it. I have included the working code as an answer to this question along with some of my thoughts. — Dylan Solms, Jun 24 '19 at 14:12

Using multiprocessing module to runs parallel processes where one is fed (dependent) by the other for Viterbi Algorithm

2 Answers2