0

In the following code, I sample a one-dimensional double-valued matrix (mu) from the beta distribution and then I use this mu matrix to sample a 2-d binary-valued matrix (Z), using the Bernoulli distribution. However sometimes I end up getting some columns which are just filled by zero values. How can I write efficiently a function that discard columns which are occupied by zeros and also discard the corresponding values in the matrix mu without causing any confliction in function func where the gsl matrices Z, mu were first defined?

Since I will need to frequently call the function which deletes zero-valued columns in the code, I am wondering what is the best way to define kind of dynamic gsl matrix and allocate a specific size but be able to resize the matrix over and over without occurring any error? Here is my code so far:

import numpy as np
cimport numpy as np
cimport cython
from cython.view cimport array as cvarray
from libc.stdlib cimport malloc, free
from libc.math cimport log, exp
from cython_gsl cimport *
import ctypes
from libc.string cimport memset

cdef extern from "stdlib.h":         
    long rand()                    
    void srand(int seed)                    

cdef extern from "time.h":
    ctypedef long time_t
    time_t time(time_t*)

cdef gsl_rng *r = gsl_rng_alloc(gsl_rng_mt19937)
srand(time(NULL))
@cython.cdivision(True)     
@cython.boundscheck(False)
@cython.wraparound(False)     
cdef void initPar( double* alpha, int* K, int* N, gsl_matrix* mu, gsl_matrix *Z ):
     cdef:
          int i, j
          int seed = rand()
     #generate random seed
     gsl_rng_set(r, seed)
     cdef double prev_mu
     for i from 0 <= i < K[0]:
         if i >= 1:
            prev_mu *= gsl_ran_beta(r, alpha[0], 1)                
         else:
            prev_mu = gsl_ran_beta(r, alpha[0], 1)

         gsl_matrix_set(mu, i, 0, prev_mu)        

     cdef double mi
     for i from 0 <= i < K[0]:
         mi= gsl_matrix_get(mu, i, 0)
         for j from 0 <= j < N[0]: 
             gsl_matrix_set(Z, j, i, gsl_ran_bernoulli(r, mi) )

     return

def func(np.ndarray[ndim=2, dtype=np.float64_t] inputs, int LFeatures = 5, double alpha):
    cdef:
        int *K_init= &LFeatures
        int N    = inputs.shape[0]
        int D    = inputs.shape[1]
        gsl_matrix *mu = gsl_matrix_alloc(K_init[0], 1)
        gsl_matrix *Z  = gsl_matrix_alloc(N, K_init[0])
        int i, j
        gsl_matrix *X    = gsl_matrix_alloc(N, D)
    for i from 0 <= i < N:
         for j from 0 <= j < D:
             gsl_matrix_set(X, i, j, inputs[i,j])

    gsl_matrix_set_zero(mu)
    gsl_matrix_set_zero(Z)
    initPar(&alpha, K_init, &N, mu, Z )

Thanks!

Dalek
  • 4,168
  • 11
  • 48
  • 100
  • Is there a good reason you're doing this in gsl? Because random numbers from the beta distribution are in numpy, numbers from the bernolli distribution are in scipy, there's [good documented ways of deleting rows/columns in numpy](https://stackoverflow.com/questions/3877491/deleting-rows-in-numpy-array). It does looks like you could do most of what you want in <10 lines of pure Python code and it would probably run just as fast – DavidW Dec 02 '17 at 16:20
  • DavidW Well, I want C speed for my code because it is going to be part of my resampling code (Mone Carlo Markov Chain) for a big data set with a lot of matrix operations, meaning for instance my matrix `Z` has the order of 100000 rows. – Dalek Dec 02 '17 at 20:48

1 Answers1

0

This answer isn't directly what you're after, since it's Numpy only. However, I really think using GSL is making your code much more complex for no actual gain.

You claim to be using GSL because you believe it gets "C speed" where as Numpy doesn't - this simply isn't true. Numpy is ultimately written in C, so if you can perform operations on the whole array at once then it is very fast. Iteration over a Numpy array is slower, but Cython+typed memoryviews allow you to do this at "C speed".

Good reasons to use GSL would be 1) interacting with an existing C library that uses GSL or 2) if GSL provides a useful function that you can't get from Numpy (in which case you could just use that function+numpy). I don't think either of these apply here.


You can easily reduce your initialization function to a short pure Python+Numpy that will be performed in optimized C loops within Numpy:

def initPar(alpha, N, K):
    mu = np.cumprod(np.random.beta(alpha,1,size=(K,)))

    # Bernoulli distribution is a special case of the binomial distribution
    # use array broadcasting to get the shape of Z
    Z = np.random.binomial(np.ones((N,1),dtype=np.int32),
                           mu[np.newaxis,:])
    return mu, Z

Your problem of removing all columns that are all zeros (and the corresponding elements of mu) is again straight-forward and quick in numpy:

mask = np.any(Z,axis=0) # true where any element of the set - i.e. the columns to keep
Z = Z[:,mask]
mu = mu[mask]
DavidW
  • 29,336
  • 6
  • 55
  • 86