2

By C# in Visual Studio on Win 7, I need to calculate mean and standard deviation for a large series of numbers. These numbers are read from a file. There may be ten housands or even more numbers. If I do not want to save them in a data structure such as array or list, because this may not be memory efficient, how to do the calculation efficently.

I also need to print the numbers to a file.

The answers at How to calculate a standard deviation [array] cannot help.

Any help would be appreciated !

Best

Community
  • 1
  • 1
user2420472
  • 1,127
  • 1
  • 12
  • 20
  • 1
    "The answers at ... cannot help" - WHY? That could be exceptionally useful information for anyone who might try to help you. – Damien_The_Unbeliever Jan 29 '14 at 18:35
  • If the last half of the data is significantly different from the first half then the data is badly skewed and calculating mean/dev is not a meaningful operation. So reading just the first half is good enough. Apply that rule iteratively until your code is fast enough :) – Hans Passant Jan 29 '14 at 18:41
  • The classic way of computing standard deviation without keeping numbers in a list is shown in [this answer](http://stackoverflow.com/a/5336513/298225). Although the function shown uses a list, it processes each number individually once, so it can be readily adapted to processing numbers as they are read from a file, without keeping them in a list. – Eric Postpischil Jan 29 '14 at 20:49

3 Answers3

6

To calculate mean and standard deviation while processing one number at a time:

Initialize Count, Sum, and SumOfSquares to zero.

As you receive each number, add one to Count, add the number to Sum, and add the square of the number to SumOfSquares.

When you have processed each number, set Mean to Sum / Count, and set StandardDeviation to Math.Sqrt(SumOfSquares / Count - Mean * Mean).

A derivation for the latter formula is shown in the Wikipedia page for standard deviation.

Eric Postpischil
  • 195,579
  • 13
  • 168
  • 312
  • This is a much better solution than the accepted response. Specially for recalculating the deviation when adding new data. It prevents having to read through the whole data set, instead you just need to keep track of the count, sum and sum of squares. Though you might need to be careful not to overflow your sumOfSquares type. – deleb Sep 07 '22 at 03:13
4

Ten thousand numbers is nothing. A few million is enough to make you think about memory, but likely is not enough that you'll have a problem putting them all in memory.

If you get up to billions, trillions, or more, then you're to the point where you likely can't pull them into memory. It doesn't seem like you're to that point yet.

You can pull them all into a List and shouldn't need to be concerned.

Having said that, the Average method of LINQ doesn't actually need to eagerly load in all of the numbers to compute the average; it can just modify an aggregate value using the "current" item and then discard it, so its memory footprint is very low. The same can be said of all of the StdDev methods that you linked to in the question; they all have a small constant memory footprint.

So all you need to do is create an IEnumerable<double> that represents a lazily populated sequence of numbers from your file, and you can call those methods on it. There is no need to eagerly load them into memory.

Servy
  • 202,030
  • 26
  • 332
  • 449
  • would you please show me a simple example by C# ? I am new to C#. Thanks ! – user2420472 Jan 29 '14 at 20:25
  • @user2420472 The question you linked to showed you exactly how to do this, with examples, and there are *plenty* of examples of this all over the web; just do some searching. – Servy Jan 29 '14 at 20:30
  • The problem asks how to calculate the results efficiently without using an array or list. Stating that keeping a list of ten thousand numbers is “nothing” (because it does not consume much of the resources you think are available) does not answer the question because it violates both a clearly stated constraint (no array or list) and a clearly stated goal (efficiency). – Eric Postpischil Jan 29 '14 at 20:46
  • @EricPostpischil It seems you stopped reading half way and didn't look at the last two paragraphs. I *do* answer the actual question asked, after first stating that it is unnecessary. – Servy Jan 29 '14 at 20:48
  • @Servy: My comment does not say the rest of the answer is wrong. It says the remarks about keeping data in a list do not answer the question. More specifically, the first three paragraphs are inappropriate. – Eric Postpischil Jan 29 '14 at 20:50
  • @EricPostpischil You're more than welcome to write your own answer that omits such commentary if you wish. I felt that the information was helpful, and is likely to allow the author and other future readers to write a more effective solution, and that that is worth taking 15 seconds of the reader's time before getting to the answer to the question asked. You're entirely free to disagree with that opinion. – Servy Jan 29 '14 at 20:52
  • This is not a helpful answer because it first spends three paragraphs telling the OP their concerns about space are misguided, which is not a justifiable conclusion (efficiency is a reasonable goal), and then uses jargon likely unfamiliar to novices asking questions like these (“eagerly”, “IEnumberable”, “lazily populated”, et cetera) to give an indirect solution (use other software that itself requires some amount of learning and setup) when the direct mathematical solution is fairly simple (accumulate count, sum, and sum of squares, use simple calculation at end). – Eric Postpischil Jan 29 '14 at 20:59
  • @EricPostpischil As I said, you're entirely welcome to your opinion. I disagree with it. That's why we have a site, and allow multiple people to answer, because they don't all agree on how best to answer a question. – Servy Jan 29 '14 at 21:01
-1

1 million numbers (32 bit int) would take about 4mb of memory, which is small.

Just load them all into a List of int and then you can operate on them with the built in linq methods.

Update: How to do the operations with linq:

math stats with Linq

Community
  • 1
  • 1
Rick Love
  • 12,519
  • 4
  • 28
  • 27
  • The problem asks how to calculate the results efficiently without using an array or list. Stating that keeping a list a million numbers uses only a “small” amount of memory does not answer the question because it violates both a clearly stated constraint (no array or list) and a clearly stated goal (efficiency). – Eric Postpischil Jan 29 '14 at 20:47