16

So, I've posted a few times and previously my problems were pretty vague. I started C++ this week and have been doing a little project.

I'm trying to calculate standard deviation & variance. My code loads a file of 100 integers and puts them into an array, counts them, calculates the mean, sum, variance and SD. But I'm having a little trouble with the variance.

I keep getting a huge number - I have a feeling it's to do with its calculation.

My mean and sum are ok.

NB:

sd & mean calcs

using namespace std;

int main() {
    int n = 0;
    int Array[100];
    float mean;
    float var, sd;
    string line;
    float numPoints;

    ifstream myfile("numbers.txt");

    if (myfile.is_open()) {
        while (!myfile.eof()) {
            getline(myfile, line);
            
            stringstream convert(line);
        
            if (!(convert >> Array[n])) {
                Array[n] = 0;
            }

            cout << Array[n] << endl;
            n++;
        }
    
        myfile.close();
        numPoints = n;
    } else
        cout << "Error loading file" << endl;

    int sum = accumulate(begin(Array), end(Array), 0, plus<int>());
    cout << "The sum of all integers: " << sum << endl;

    mean = sum / numPoints;
    cout << "The mean of all integers: " << mean << endl;

    var = (Array[n] - mean) * (Array[n] - mean) / numPoints;
    sd = sqrt(var);
    cout << "The standard deviation is: " << sd << endl;

    return 0;
}
0009laH
  • 1,960
  • 13
  • 27
Jack
  • 321
  • 2
  • 5
  • 16
  • 1
    In `(Array[n] - mean)` isn't `n` one more than the number of elements you have read? Also, [`while (!myfile.eof())` is almost always wrong](http://stackoverflow.com/questions/5605125/why-is-iostreameof-inside-a-loop-condition-considered-wrong) – Bo Persson Oct 21 '15 at 20:35
  • 2
    You should use double instead of float – FredK Oct 21 '15 at 20:47

8 Answers8

20

Here's another approach using std::accumulate but without pow. In addition, we can use an anonymous function to define how to calculate the variance after calculating the mean. Note that this computes the unbiased sample variance, so we divide by the sample size subtracted by 1.

#include <vector>
#include <algorithm>
#include <numeric>

template<typename T>
T variance(const std::vector<T> &vec) {
    const size_t sz = vec.size();
    if (sz <= 1) {
        return 0.0;
    }

    // Calculate the mean
    const T mean = std::accumulate(vec.begin(), vec.end(), 0.0) / sz;

    // Now calculate the variance
    auto variance_func = [&mean, &sz](T accumulator, const T& val) {
        return accumulator + ((val - mean)*(val - mean) / (sz - 1));
    };

    return std::accumulate(vec.begin(), vec.end(), 0.0, variance_func);
}

A sample of how to use this function:

#include <iostream>
int main() {
    const std::vector<double> vec = {1.0, 5.0, 6.0, 3.0, 4.5};
    std::cout << variance(vec) << std::endl;
}
rayryeng
  • 102,964
  • 22
  • 184
  • 193
14

As the other answer by horseshoe correctly suggests, you will have to use a loop to calculate variance otherwise the statement

var = ((Array[n] - mean) * (Array[n] - mean)) / numPoints;

will just consider a single element from the array.

Just improved horseshoe's suggested code:

var = 0;
for( n = 0; n < numPoints; n++ )
{
  var += (Array[n] - mean) * (Array[n] - mean);
}
var /= numPoints;
sd = sqrt(var);

Your sum works fine even without using loop because you are using accumulate function which already has a loop inside it, but which is not evident in the code, take a look at the equivalent behavior of accumulate for a clear understanding of what it is doing.

Note: X ?= Y is short for X = X ? Y where ? can be any operator. Also you can use pow(Array[n] - mean, 2) to take the square instead of multiplying it by itself making it more tidy.

Ahmed Akhtar
  • 1,444
  • 1
  • 16
  • 28
  • 1
    thanks for the 'Note' it was useful. compare your code to horseshoe why is the for statement better than the while? or is there no real difference? – Jack Oct 22 '15 at 11:32
  • 2
    @jack technically there is no difference between the **for** and the **while** loops (except syntax), but usually when you need: (1) initialization of a variable before starting the loop, (2) an increment in the variable at the end of the loop and then (3) want to check for a condition to reiterate; then **for** makes the code much more readable and also ensures that you don't forget any of the three. – Ahmed Akhtar Oct 23 '15 at 03:56
  • Am I missing something? var /= (numPoints-1) , not / numPoints – WurmD Jun 07 '19 at 11:15
  • @WurmD Why do you think it should be divided by `numPoints - 1` and not by `numPoints`? – Ahmed Akhtar Jun 18 '19 at 13:53
  • Look at the other responses, half of them are size-1 @AhmedAkhtar – WurmD Jun 19 '19 at 21:25
  • @WurmD The `N` in the formula of variance means the number of observations, which is `numPoints` in our case, not `numPoints-1` – Ahmed Akhtar Jun 24 '19 at 11:35
  • 1
    Usually you divide by the number of points subtracted by 1 to provide an unbiased estimate of the variance. https://stats.stackexchange.com/q/100041/86678 – rayryeng Sep 27 '19 at 04:28
  • 1
    @rayryeng Thanks for the explanation to why `numPoints-1` could be used. However, I used just `numPoints` because it was in line with the formula posted by the OP. But thanks again for clarifying. – Ahmed Akhtar Sep 28 '19 at 05:13
3

Two simple methods to calculate Standard Deviation & Variance in C++.

#include <math.h>
#include <vector>

double StandardDeviation(std::vector<double>);
double Variance(std::vector<double>);

int main()
{
     std::vector<double> samples;
     samples.push_back(2.0);
     samples.push_back(3.0);
     samples.push_back(4.0);
     samples.push_back(5.0);
     samples.push_back(6.0);
     samples.push_back(7.0);

     double std = StandardDeviation(samples);
     return 0;
}

double StandardDeviation(std::vector<double> samples)
{
     return sqrt(Variance(samples));
}

double Variance(std::vector<double> samples)
{
     int size = samples.size();

     double variance = 0;
     double t = samples[0];
     for (int i = 1; i < size; i++)
     {
          t += samples[i];
          double diff = ((i + 1) * samples[i]) - t;
          variance += (diff * diff) / ((i + 1.0) *i);
     }

     return variance / (size - 1);
}
D.Zadravec
  • 647
  • 3
  • 7
  • Do you have a reference for that approach? Is it this one? https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Welford's_online_algorithm – Tilman Vogel Jun 16 '20 at 11:20
1

Your variance calculation is outside the loop and thus it is only based on the n== 100 value. You need an additional loop.

You need:

var = 0;
n=0;
while (n<numPoints){
   var = var + ((Array[n] - mean) * (Array[n] - mean));
   n++;
}
var /= numPoints;
sd = sqrt(var);
rayryeng
  • 102,964
  • 22
  • 184
  • 193
horseshoe
  • 1,437
  • 14
  • 42
0

Rather than writing out more loops, you can create a function object to pass to std::accumulate to calculate the mean.

template <typename T>
struct normalize {
    T operator()(T initial, T value) {
        return initial + pow(value - mean, 2);
    }
    T mean;
}

While we are at it, we can use std::istream_iterator to do the file loading, and std::vector because we don't know how many values there are at compile time. This gives us:

int main()
{
    std::vector<int> values; // initial capacity, no contents yet
    
    ifstream myfile("numbers.txt");
    if (myfile)
    {
        values.assign(std::istream_iterator<int>(myfile), {});
    }
    else { std::cout << "Error loading file" << std::endl; }
    
    float sum = std::accumulate(values.begin(), values.end(), 0, plus<int>()); // plus is the default for accumulate, can be omitted
    std::cout << "The sum of all integers: " << sum << std::endl;
    float mean = sum / values.size();
    std::cout << "The mean of all integers: " << mean << std::endl;
    float var = std::accumulate(values.begin(), values.end(), 0, normalize<float>{ mean }) / values.size();
    float sd = sqrt(var);
    std::cout << "The standard deviation is: " << sd << std::endl;
    return 0;
}
DarenW
  • 16,549
  • 7
  • 63
  • 102
Caleth
  • 52,200
  • 2
  • 44
  • 75
0
#include <iostream>
#include <numeric>
#include <vector>
#include <cmath>
#include <utility>
#include <array>

template <class InputIterator, class T>
void Mean(InputIterator first, InputIterator last, T& mean) {
  int n = std::distance(first, last);
  mean = std::accumulate(first, last, static_cast<T>(0)) / n;
}

template <class InputIterator, class T>
void StandardDeviation(InputIterator first, InputIterator last, T& mean, T& stardard_deviation) {
  int n = std::distance(first, last);
  mean = std::accumulate(first, last, static_cast<T>(0)) / n;
  T s = std::accumulate(first, last, static_cast<T>(0), [mean](double x, double y) {
    T denta = y - mean;
    return x + denta*denta;
  });
  stardard_deviation = s/n;
}

int main () {
  std::vector<int> v = {10, 20, 30};

  double mean = 0;
  Mean(v.begin(), v.end(), mean);
  std::cout << mean << std::endl;

  double stardard_deviation = 0;
  StandardDeviation(v.begin(), v.end(), mean, stardard_deviation);
  std::cout << mean << " " << stardard_deviation << std::endl;

  double a[3] = {10.5, 20.5, 30.5};
  Mean(a, a+3, mean);
  std::cout << mean << std::endl;
  StandardDeviation(a, a+3, mean, stardard_deviation);
  std::cout << mean << " " << stardard_deviation << std::endl;

  std::array<int, 3> m = {1, 2, 3};
  Mean(m.begin(), m.end(), mean);
  std::cout << mean << std::endl;
  StandardDeviation(m.begin(), m.end(), mean, stardard_deviation);
  std::cout << mean << " " << stardard_deviation << std::endl;
  return 0;
}
  • While this code may provide a solution to the question, it's better to add context as to why/how it works. This can help future users learn, and apply that knowledge to their own code. You are also likely to have positive feedback from users in the form of upvotes, when the code is explained. – borchvm Aug 11 '20 at 10:53
  • Thank you, my code has a problem with performance. I fixed it. I hope it will better than. With the source code, I am still not satisfied with it. When I want compute mean and standard-deviation then the mean function is repeated 2 times. – manh duong Aug 11 '20 at 17:52
0

If you have a table with F(x) Values

A basic approach with using map.

Map first entry holds value and second entry holds f(x) (probability) value of the problem.

Note: Do not hesitate my Class name, you can simply use it in your program without this.

Find Mean

Find the mean value with this map and return.

double Expectation::meanFinder(map<double,double> m)
{
    double sum = 0;
    for (auto it : m)
    {
        sum += it.first * it.second;
    }
    cout << "Mean: " << sum << endl;
    return sum;
}

Calculate Variance and Standard Derivation

Calculate those values and print. (If you want, you can return it too)

void Expectation::varianceFinder(map<double,double> m, double mean)
{
    double sum = 0;
    for (auto it : m)
    {
        double diff_square = (it.first - mean) * (it.first - mean);
        sum += diff_square * it.second;
    }
    cout << "Variance: " << sum << endl;
    cout << "Standart Derivation: " << sqrt(sum) << endl;
}

Notice that, takes a value that have mean. If you want, you can call meanFinder() function in this function as well.

Basic Usage

A basic usage with cin

void findVarianceTest(Expectation& expect)
{
    int size = 0;
    cout << "Enter test size:";
    cin >> size;
    map<double, double> m;   
    for (int i = 0; i < size; i++)
    {
        double freq = 0;
        double f_x = 0;
        cout << "Enter " << i+1 << ". frequency and f(X) (probability) respectively" << endl;
        cin >> freq;
        cin >> f_x;
        m.insert(pair<double,double>(freq,f_x));
    }
    expect.varianceFinder(m, expect.meanFinder(m));

}

Notice that, I call meanFinder() while calling varianceFinder().

Input Output

Dharman
  • 30,962
  • 25
  • 85
  • 135
Muhammedogz
  • 774
  • 8
  • 21
0

Assuming that data points are in std::vector<double> data, there is maybe a slightly more efficient and readable code than the accepted answer:

double var = 0;
for (double x : data)
{
    const double diff = x - mean;
    const double diff_sqare = std::pow(diff, 2.0);
    var += diff_sqare;
}
var /= data.size();
return std::sqrt(var);
Kepler
  • 155
  • 3
  • 6