Logistic regression code stops working above ~43,500 generated observations

Question

Having some difficulty troubleshooting code I wrote in C to perform a logistic regression. While it seems to work on smaller, semi-randomized datasets, it stops working (e.g. assigning proper probabilities of belonging to class 1) at around the point where I pass 43,500 observations (determined by tweaking the number of observations created. When creating the 150 features used in the code, I do create the first two as a function of the number of observations, so I'm not sure if maybe that's the issue here, though I am using double precision. Maybe there's an overflow somewhere in the code?

The below code should be self-contained; it generates m=50,000 observations with n=150 features. Setting m below 43,500 should return "Percent class 1: 0.250000", setting to 44,000 or above will return "Percent class 1: 0.000000", regardless of what max_iter (number of times we sample m observations) is set to.

The first feature is set to 1.0 divided by the total number of observations, if class 0 (first 75% of observations), or the index of the observation divided by the total number of observations otherwise.

The second feature is just index divided by total number of observations.

All other features are random.

The logistic regression is intended to use stochastic gradient descent, randomly selecting an observation index, computing the gradient of the loss with the predicted y using current weights, and updating weights with the gradient and learning rate (eta).

Using the same initialization with Python and NumPy, I still get the proper results, even above 50,000 observations.

#include <stdlib.h>
#include <stdio.h>
#include <math.h>
#include <time.h>

// Compute z = w * x + b
double dlc( int n, double *X, double *coef, double intercept )
{
    double y_pred = intercept;
    for (int i = 0; i < n; i++)
    {
        y_pred += X[i] * coef[i];
    }
    return y_pred;
}

// Compute y_hat = 1 / (1 + e^(-z))
double sigmoid( int n, double alpha, double *X, double *coef, double beta, double intercept )
{
    double y_pred;
    y_pred = dlc(n, X, coef, intercept);
    y_pred = 1.0 / (1.0 + exp(-y_pred));

    return y_pred;
}

// Stochastic gradient descent
void sgd( int m, int n, double *X, double *y, double *coef, double *intercept, double eta, int max_iter, int fit_intercept, int random_seed )
{
    double *gradient_coef, *X_i;
    double y_i, y_pred, resid;
    int idx;

    double gradient_intercept = 0.0, alpha = 1.0, beta = 1.0;

    X_i = (double *) malloc (n * sizeof(double));
    gradient_coef = (double *) malloc (n * sizeof(double));

    for ( int i = 0; i < n; i++ )
    {
        coef[i] = 0.0;
        gradient_coef[i] = 0.0;
    }
    *intercept = 0.0;

    srand(random_seed);
    
    for ( int epoch = 0; epoch < max_iter; epoch++ )
    {
        for ( int run = 0; run < m; run++ )
        {
            // Randomly sample an observation
            idx = rand() % m;
            for ( int i = 0; i < n; i++ )
            {
                X_i[i] = X[n*idx+i];
            }
            y_i = y[idx];
            // Compute y_hat
            y_pred = sigmoid( n, alpha, X_i, coef, beta, *intercept );
            resid = -(y_i - y_pred);
            // Compute gradients and adjust weights
            for (int i = 0; i < n; i++)
            {
                gradient_coef[i] = X_i[i] * resid;
                coef[i] -= eta * gradient_coef[i];
            }
            if ( fit_intercept == 1 )
            {
                *intercept -= eta * resid;
            }
        }
    }
}

int main(void)
{
    double *X, *y, *coef, *y_pred;
    double intercept;
    double eta = 0.05;
    double alpha = 1.0, beta = 1.0;
    long m = 50000;
    long n = 150;
    int max_iter = 20;

    long class_0 = (long)(3.0 / 4.0 * (double)m);
    double pct_class_1 = 0.0;

    clock_t test_start;
    clock_t test_end;
    double test_time;

    printf("Constructing variables...\n");
    X = (double *) malloc (m * n * sizeof(double));
    y = (double *) malloc (m * sizeof(double));
    y_pred = (double *) malloc (m * sizeof(double));
    coef = (double *) malloc (n * sizeof(double));

    // Initialize classes
    for (int i = 0; i < m; i++)
    {
        if (i < class_0)
        {
            y[i] = 0.0;
        }
        else
        {
            y[i] = 1.0;
        }
    }

    // Initialize observation features
    for (int i = 0; i < m; i++)
    {
        if (i < class_0)
        {
            X[n*i] = 1.0 / (double)m;
        }
        else
        {
            X[n*i] = (double)i / (double)m;
        }
        X[n*i + 1] = (double)i / (double)m;
        for (int j = 2; j < n; j++)
        {
            X[n*i + j] = (double)(rand() % 100) / 100.0;
        }
    }

    // Fit weights
    printf("Running SGD...\n");
    test_start = clock();
    sgd( m, n, X, y, coef, &intercept, eta, max_iter, 1, 42 );
    test_end = clock();
    test_time = (double)(test_end - test_start) / CLOCKS_PER_SEC;
    printf("Time taken: %f\n", test_time);

    // Compute y_hat and share of observations predicted as class 1
    printf("Making predictions...\n");
    for ( int i = 0; i < m; i++ )
    {
        y_pred[i] = sigmoid( n, alpha, &X[i*n], coef, beta, intercept );
    }

    printf("Printing results...\n");
    for ( int i = 0; i < m; i++ )
    {
        //printf("%f\n", y_pred[i]);
        if (y_pred[i] > 0.5)
        {
            pct_class_1 += 1.0;
        }
        // Troubleshooting print
        if (i < 10 || i > m - 10)
        {
            printf("%g\n", y_pred[i]);
        }
    }
    printf("Percent class 1: %f", pct_class_1 / (double)m);

    return 0;
}

For reference, here is my (presumably) equivalent Python code, which returns the correct percent of identified classes at more than 50,000 observations:

import numpy as np
import time

def sigmoid(x):
    return 1 / (1 + np.exp(-x))


class LogisticRegressor:
    def __init__(self, eta, init_runs, fit_intercept=True):
        self.eta = eta
        self.init_runs = init_runs
        self.fit_intercept = fit_intercept
    
    def fit(self, x, y):
        m, n = x.shape
        self.coef = np.zeros((n, 1))
        self.intercept = np.zeros((1, 1))
        
        for epoch in range(self.init_runs):
            for run in range(m):
                idx = np.random.randint(0, m)
                x_i = x[idx:idx+1, :]
                y_i = y[idx]
                y_pred_i = sigmoid(x_i.dot(self.coef) + self.intercept)
                gradient_w = -(x_i.T * (y_i - y_pred_i))
                self.coef -= self.eta * gradient_w
                if self.fit_intercept:
                    gradient_b = -(y_i - y_pred_i)
                    self.intercept -= self.eta * gradient_b
        
    def predict_proba(self, x):
        m, n = x.shape
        y_pred = np.ones((m, 2))
        y_pred[:,1:2] = sigmoid(x.dot(self.coef) + self.intercept)
        y_pred[:,0:1] -= y_pred[:,1:2]
        return y_pred
    
    def predict(self, x):
        return np.round(sigmoid(x.dot(self.coef) + self.intercept))
    

m = 50000
n = 150
class1 = int(3.0 / 4.0 * m)

X = np.random.rand(m, n)
y = np.zeros((m, 1))

for obs in range(m):
    if obs < class1:
        continue
    else:
        y[obs,0] = 1

for obs in range(m):
    if obs < class1:
        X[obs, 0] = 1.0 / float(m)
    else:
        X[obs, 0] = float(obs) / float(m)
    X[obs, 1] = float(obs) / float(m)

logit = LogisticRegressor(0.05, 20)
start_time = time.time()
logit.fit(X, y)
end_time = time.time()
print(round(end_time - start_time, 2))
y_pred = logit.predict(X)
print("Percent:", y_pred.sum() / len(y_pred))

What does "stops working" mean? Does it crash, does it hang, does it give garbage output? — Lundin, Jan 31 '23 at 15:10
I tried to explain in second paragraph of my write-up, sorry if it still wasn't clear. Essentially garbage output. Whereas the code properly assigns probabilities that help identify class 1 (the last quarter of observations) below 50,000 total observations, it fails to above 50,000 and so all probabilities are near-zero. There must be an issue present for that to occur, but I'm not sure where it would be arising. — hillard28, Jan 31 '23 at 15:14
Not sure if relevant, but what value is `RAND_MAX` on your system? — Ian Abbott, Jan 31 '23 at 15:18
hillard28, Debug tips 1) for FP issues: Use `"%g"` instead of `"%f"`. It is more informative, especially for small values. 2) Rather than only describe failed output like "it fails to above 50,000 and so all probabilities are near-zero.", also post the failed and expected output. — chux - Reinstate Monica, Jan 31 '23 at 15:20
@IanAbbott that was a good point! RAND_MAX was 32,767, shouldn't have made assumptions. At any rate, I still have the issue even when using X[n*i + j] = (double)rand() / (double)RAND_MAX; — hillard28, Jan 31 '23 at 15:25
@chux-ReinstateMonica thanks for the tip, sorry about that. I added a quick trouble-shooting print at the end where y predictions are made. It prints the first 10 and last 10 observations with %g. Assuming at least 40 observations are used, the first 10 should be near-zero and the last 10 should be near-one. Running with m=43000 and then m=50000 shows the discrepancy. — hillard28, Jan 31 '23 at 15:29
I slightly modified your code so m varies from 43500 to 44000 and around m=43630 the percent class 1 result quickly drops to zero. IMO it's your algorithm itself, not some overflow. — Jabberwocky, Jan 31 '23 at 15:33
... m = 43630 Percent class 1: 0.23532, m = 43640 Percent class 1: 0.223075, m = 43650 Percent class 1: 0.178465, m = 43660 Percent class 1: 0.0863262, m = 43670 Percent class 1: 0.0393634, m = 43680 Percent class 1: 0.000892857, m = 43690 Percent class 1: 0, — Jabberwocky, Jan 31 '23 at 15:35
On the other hand, I ran the program as given in the question, which has `m == 50000`, and I got "Percent class 1: 0.250000", as the OP seems to expect. Same for `m == 100000`, in fact. Certainly, then, some machine or implementation characteristic is a factor. — John Bollinger, Jan 31 '23 at 15:39
@JohnBollinger interesting, I got the same results as the OP.... — Jabberwocky, Jan 31 '23 at 15:41
@hillard28 Another good this to report is `printf("%d\n", FLT_EVAL_METHOD);`. This helps ID what precision/range is used for intermediate FP results. For me it is 0 (not extended math) and I receive the same percent as [@John Bollinger](https://stackoverflow.com/questions/75299046/logistic-regression-code-stops-working-above-43-500-generated-observations/75299175?noredirect=1#comment132870595_75299046). I do not think this is the issue, but looking into the corners. — chux - Reinstate Monica, Jan 31 '23 at 15:43
I'm running it on MS Visual Studio 2022. @hillard28 what is your system? — Jabberwocky, Jan 31 '23 at 15:44
And yep, no problem with gcc. Sounds like some Microsoft bugs/weird_stuff — Jabberwocky, Jan 31 '23 at 15:47
I can reproduce OP's result by using the example implementation of `rand()` and `srand()` from the C specification (which assumes `RAND_MAX` is 32767). — Ian Abbott, Jan 31 '23 at 15:47
@chux-ReinstateMonica I get a value of 0 as well. Jabberwocky, not sure if this is necessarily what you're getting at but I've tried using gcc and clang installed via msys2/mingw64. I'm on Windows 10 64-bit. — hillard28, Jan 31 '23 at 15:48
@IanAbbott is this even after using the adjusted (double)rand() / (double)RAND_MAX , rather than (double)(rand() % 100) / 100.0 ? — hillard28, Jan 31 '23 at 15:53
@hillard28 If `RAND_MAX > m` then `idx = rand() % m;` will never generate various values. As a stop gap measure, replace all `rand()` with `(((long) rand() << 15) ^ rand())`. — chux - Reinstate Monica, Jan 31 '23 at 15:59
Ah, yeah I just realized that I failed to adjust that as well. Using @chux-ReinstateMonica's recommendation gives me 0.250000! I'll need to test things out a bit more but thanks so much for the speedy assistance everyone! — hillard28, Jan 31 '23 at 16:04

John Bollinger · Accepted Answer · 2023-01-31T16:28:05.477

The issue is here:

            // Randomly sample an observation
            idx = rand() % m;

... in light of the fact that the OP's RAND_MAX is 32767. This is exacerbated by the fact that all of the class 0 observations are at the end.

All samples will be drawn from the first 32768 observations, and when the total number of observations is greater than that, the proportion of class 0 observations among those that can be sampled is less than 0.25. At 43691 total observations, there are no class 0 observations among those that can be sampled.

As a secondary issue, rand() % m does not yield a wholly uniform distribution if m does not evenly divide RAND_MAX + 1, though the effect of this issue will be much more subtle.

Bottom line: you need a better random number generator.

At minimum, you could consider combining the bits from two calls to rand() to yield an integer with sufficient range, but you might want to consider getting a third-party generator. There are several available.

chux - Reinstate Monica · Answer 2 · 2023-01-31T15:25:27.573

3

Note: OP reports "m=50,000 observations with n=150 features.", so perhaps this is not the issue for OP, but I'll leave this answer up for reference when OP tries larger tasks.

A potential issue:

long overflow

m * n * sizeof(double) risks overflow when long is 32-bit and m*n > LONG_MAX (or about 46,341 if m, n are the same).

OP does report

A first step is to perform the multiplication using size_t math where we gain at least 1 more bit in the calculation.

// m * n * sizeof(double)
sizeof(double) * m * n

Yet unless OP's size_t is more than 32-bit, we still have trouble.

IAC, I recommend to use size_t for array sizing and indexing.

Check allocations for failure too.

edited Jan 31 '23 at 15:25

answered Jan 31 '23 at 15:10

chux - Reinstate Monica

143,097
13
135
256

1

50000 * 150 * 8 = 60*10^6. How is that an overflow? – Lundin Jan 31 '23 at 15:18
1

@Lundin Its an overflow with larger numbers as pointed out in the answer. IAC, 43,500 is curiously near `sqrt(LONG_MAX)` when `long` is 32-bit. – chux - Reinstate Monica Jan 31 '23 at 15:22
It's still not anywhere near 2^31. – Lundin Jan 31 '23 at 15:28
More relevant, it isn't possible to pass the value `m=50000` to a 16 bit `int`, but I don't think the OP is using a 16 bit system. – Lundin Jan 31 '23 at 15:32
1

@Lundin, We are talking about different 'it's. `50000 * 150 * 8` is the focus you present, this answer addresses general concerns about `m * n * sizeof(double)` overflow. OP does have _some_ issue. It makes sense that code test for the usual suspects (integer overflow, failed allocations, division by 0, ....) to let us know those issues are not contributors. – chux - Reinstate Monica Jan 31 '23 at 15:33
@Lundin Re: "it isn't possible to pass the value m=50000 to a 16 bit int" does not apply here as code uses `long m`. – chux - Reinstate Monica Jan 31 '23 at 15:35
Or not. `void sgd( int m,` – Lundin Jan 31 '23 at 15:36
Btw all of these comments are great arguments for not using "sloppy typing" with the non-portable default integer types of C. Use `int32_t` or `int64_t` instead. – Lundin Jan 31 '23 at 15:38
True, code is slack on various integer type issues. – chux - Reinstate Monica Jan 31 '23 at 15:38

chux - Reinstate Monica · Answer 3 · 2023-01-31T17:33:19.553

Since RAND_MAX may be too small and array indexing should be done using size_t math, consider a helper function to generate a random index over the entire size_t range.

// idx = rand() % m;
size_t idx = rand_size_t() % (size_t)m;

If stuck with the standard rand(), below is a helper function to extend its range as needed.
It uses the real nifty IMAX_BITS(m).

#include <assert.h>
#include <limits.h>
#include <stdint.h>
#include <stdlib.h>

// https://stackoverflow.com/a/4589384/2410359
/* Number of bits in inttype_MAX, or in any (1<<k)-1 where 0 <= k < 2040 */
#define IMAX_BITS(m) ((m)/((m)%255+1) / 255%255*8 + 7-86/((m)%255+12))

// Test that RAND_MAX is a power of 2 minus 1
_Static_assert((RAND_MAX & 1) && ((RAND_MAX/2 + 1) & (RAND_MAX/2)) == 0, "RAND_MAX is not a Mersenne number");

#define RAND_MAX_WIDTH (IMAX_BITS(RAND_MAX))
#define SIZE_MAX_WIDTH (IMAX_BITS(SIZE_MAX))

size_t rand_size_t(void) {
  size_t index = (size_t) rand();
  for (unsigned i = RAND_MAX_WIDTH; i < SIZE_MAX_WIDTH; i += RAND_MAX_WIDTH) {
      index <<= RAND_MAX_WIDTH;
      index ^= (size_t) rand();
  }
  return index;
}

Further considerations can replace the rand_size_t() % (size_t)m with a more uniform distribution.

Ian Abbott · Answer 4 · 2023-01-31T19:05:19.383

1

As has been determined elsewhere, the problem is due to the implementation's RAND_MAX value being too small.

Assuming 32-bit ints, a slightly better PRNG function can be implemented in the code, such as this C implementation of the minstd_rand() function from C++:

#define MINSTD_RAND_MAX 2147483646

// Code assumes `int` is at least 32 bits wide.

static unsigned int minstd_seed = 1;

static void minstd_srand(unsigned int seed)
{
    seed %= 2147483647;
    // zero seed is bad!
    minstd_seed = seed ? seed : 1;
}

static int minstd_rand(void)
{
    minstd_seed = (unsigned long long)minstd_seed * 48271 % 2147483647;
    return (int)minstd_seed;
}

Another problem is that expressions of the form rand() % m produce a biased result when m does not divide (unsigned int)RAND_MAX + 1. Here is an unbiased function that returns a random integer from 0 to le inclusive, making use of the minstd_rand() function defined earlier:

static int minstd_rand_max(int le)
{
    int r;

    if (le < 0)
    {
        r = le;
    }
    else if (le >= MINSTD_RAND_MAX)
    {
        r = minstd_rand();
    }
    else
    {
        int rm = MINSTD_RAND_MAX - le + MINSTD_RAND_MAX % (le + 1);

        while ((r = minstd_rand()) > rm)
        {
        }
        r /= (rm / (le + 1) + 1);
    }
    return r;
}

(Actually, it does still have a very small bias because minstd_rand() will never return 0.)

For example, replace rand() % 100 with minstd_rand_max(99), and replace rand() % m with minstd_rand_max(m - 1). Also replace srand(random_seed) with minstd_srand(random_seed).

edited Jan 31 '23 at 19:05

answered Jan 31 '23 at 17:03

Ian Abbott

15,083
19
33

@chux-ReinstateMonica Thanks. Corrected, and also stored the seed in an `unsigned int` since I'm assuming 32-bit `int`. – Ian Abbott Jan 31 '23 at 17:51
@chux-ReinstateMonica Just due to forgetfulness on my part. I'll change it. – Ian Abbott Jan 31 '23 at 18:01
@chux-ReinstateMonica Yes, I just noticed that as you were typing your comment. I must have appended the 6 instead of changing the last digit! The modulus is a Mersenne number, so the max value is one less than that. – Ian Abbott Jan 31 '23 at 18:08
Agree max value is one less than a [Mersenne number](https://mathworld.wolfram.com/MersenneNumber.html). I think that since it is not a Mersenne number, it will break various code bases, even though it is allowed by C. – chux - Reinstate Monica Jan 31 '23 at 18:14
Note that `minstd_srand(0)` leads to `minstd_rand()` always returning 0. I won't comment further other than to say it is interesting how challenging it is to make even a _simply_ PRNG. – chux - Reinstate Monica Jan 31 '23 at 18:17
@chux-ReinstateMonica Well, various implementations set RAND_MAX to 32767 which is composite 7*13*151. – Ian Abbott Jan 31 '23 at 18:18
Why do you see 32767 as a problem? Many Mersenne numbers are not primes. Are you thinking of [Mersenne primes](https://en.wikipedia.org/wiki/Mersenne_prime)? – chux - Reinstate Monica Jan 31 '23 at 18:21
@chux-ReinstateMonica Yes, I was thinking of Mersenne primes. I'd never heard of Mersenne numbers before. – Ian Abbott Jan 31 '23 at 18:50
@chux-ReinstateMonica I've dealt with the zero seed value by replacing it with 1. I do not think `minstd_seed` will ever get changed to 0 by `minstd_rand()`. – Ian Abbott Jan 31 '23 at 18:54
@chux-ReinstateMonica Another interesting problem is that since `minstd_rand()` will now never return 0 (but could return 1), functions that assume there are `MINSTD_RAND_MAX+1` possible return values (including my `minstd_rand_max()`) will have a tiny bias because there are only `MINSTD_RAND_MAX` possible return values. – Ian Abbott Jan 31 '23 at 19:02

Logistic regression code stops working above ~43,500 generated observations

4 Answers4