3

I am working on classifying texts based on word occurrences. One of the steps is to estimate the probability of a particular text for each possible class. To do this, I am given NSAMPLES of texts from a vocabulary of NFEATURES words, each labelled with one of NLABELS class labels. From this, I construct a binary occurrence matrix where entry(sample,feature) is 1 iff text "sample" contains the word encoded by "feature".

From the occurrence matrix, we can construct a matrix of conditional probabilities and then smooth this so the probabilities are neither 0.0 or 1.0, using the following code (copied from Coursera notebook):

def laplace_smoothing(labels, binary_data, n_classes):
    # Compute the parameter estimates (adjusted fraction of documents in class that contain word)
    n_words = binary_data.shape[1]
    alpha = 1 # parameters for Laplace smoothing
    theta = np.zeros([n_classes, n_words]) # stores parameter values - prob. word given class
    for c_k in range(n_classes): # 0, 1, ..., 19
        class_mask = (labels == c_k)
        N = class_mask.sum() # number of articles in class
        theta[c_k, :] = (binary_data[class_mask, :].sum(axis=0) + alpha)/(N + alpha*2)
    return theta

To see the problem, here is code to mock up inputs and call for the result:

import tensorflow_probability as tfp
tfd = tfp.distributions

NSAMPLES = 2000   # Size of corpus
NFEATURES = 10000 # Number of words in corpus
NLABELS = 10      # Number of classes
ONE_PROB = 0.02   # Probability that binary_datum will be 1

def mock_binary_data( nsamples, nfeatures, one_prob ):
    binary_data = ( np.random.uniform( 0, 1, ( nsamples, nfeatures ) ) < one_prob ).astype( 'int32' )
    return binary_data

def mock_labels( nsamples, nlabels ):
    labels = np.random.randint( 0, nlabels, nsamples )
    return labels

binary_data = mock_binary_data( NSAMPLES, NFEATURES, ONE_PROB )
labels = mock_labels( NSAMPLES, NLABELS )
smoothed_data = laplace_smoothing( labels, binary_data, NLABELS )

bernoulli = tfd.Independent( tfd.Bernoulli( probs = smoothed_data ), reinterpreted_batch_ndims = 1 )

test_random_data = mock_binary_data( 1, NFEATURES, ONE_PROB )[ 0 ]
bernoulli.prob( test_random_data )

When I execute this, I get:

<tf.Tensor: shape=(10,), dtype=float32, numpy=array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.], dtype=float32)>

that is, all the probabilities are zero. Some step here is incorrect, can you please help me find it?

Frightera
  • 4,773
  • 2
  • 13
  • 28
Mark Lavin
  • 1,002
  • 1
  • 15
  • 30
  • 1
    Try to reduce your `NFEATURES`, I guess probabilities are so low that shown as 0. When you set `NFEATURES` as 100, it should show some probabilities I think. – Frightera Jun 25 '21 at 20:37
  • @Frightera, thanks! I also observed that reducing NFEATURES makes the probabilities non-zero. However, there really are that many features (words in the dictionary) so I need to find some other way to resolve this. – Mark Lavin Jun 25 '21 at 21:26
  • 2
    Consider using logits instead of probs for the parameterization, or changing to a float64 dtype. – Brian Patton Jun 28 '21 at 16:07
  • Brian, thanks! I tried changing the dtype for the call to ```tfd.Bernoulli``` to ```float64``` but that had no effect; in particular, ```probability_tensor``` was still ```float32```. It turns out there is no ```dtype``` arg for ```tfd.Independent```. I don't understand how I would change the code to use logits; can you please explain? – Mark Lavin Jun 28 '21 at 18:13
  • What is the underlying problem for which you are writing this code? In your example you have choosen 10,000 features where sample size is merely 2000, this seems unrealistinc. If you change features to a smaller number, say 20, your code fragment works perfectly. – codeR Jul 05 '21 at 11:46
  • @codeR In natural language processing that actually is pretty realistic. You have 2000 texts (samples) and your features are all the different words that appear in any of the texts. If you had only 20 features, that would mean all your 2000 texts are composed of the same 20 different words. – Nic Moetsch Jul 05 '21 at 12:04

0 Answers0