Questions tagged [statistics]

Consider whether your question would be better asked at https://stats.stackexchange.com. Statistics is the mathematical study of using probability to infer characteristics of a population from a limited number of samples or observations.

Statistics is the scientific study of the collection, analysis, interpretation, presentation, and organization of data. Numerous programming languages provide support for implementing statistical techniques.

Consider whether your question would be better asked at CrossValidated, a Stack Exchange site for probability, statistics, data analysis, data mining, experimental design, and machine learning. StackOverflow questions on statistics should be about implementation and programming problems, not about theoretical discussions of statistics or research design. Therefore, this tag should never be used alone but always in combination with a specific programming language (like for example , , , , ).

16319 questions
778
votes
11 answers

Get statistics for each group (such as count, mean, etc) using pandas GroupBy?

I have a dataframe df and I use several columns from it to groupby: df['col1','col2','col3','col4'].groupby(['col1','col2']).mean() In the above way, I almost get the table (dataframe) that I need. What is missing is an additional column that…
Roman
  • 124,451
  • 167
  • 349
  • 456
590
votes
15 answers

Cosmic Rays: what is the probability they will affect a program?

Once again I was in a design review, and encountered the claim that the probability of a particular scenario was "less than the risk of cosmic rays" affecting the program, and it occurred to me that I didn't have the faintest idea what that…
Mark Harrison
  • 297,451
  • 125
  • 333
  • 465
492
votes
35 answers

How to find the statistical mode?

In R, mean() and median() are standard functions which do what you'd expect. mode() tells you the internal storage mode of the object, not the value that occurs the most in its argument. But is there is a standard library function that implements…
Nick
  • 21,555
  • 18
  • 47
  • 50
457
votes
11 answers

Generating statistics from Git repository

I'm looking for some good tools/scripts that allow me to generate a few statistics from a git repository. I've seen this feature on some code hosting sites, and they contained information like... commits per author commits per…
BastiBen
  • 19,679
  • 11
  • 56
  • 86
310
votes
15 answers

How to normalize a numpy array to a unit vector

I would like to convert a NumPy array to a unit vector. More specifically, I am looking for an equivalent version of this normalisation function: def normalize(v): norm = np.linalg.norm(v) if norm == 0: return v return v /…
Donbeo
  • 17,067
  • 37
  • 114
  • 188
290
votes
12 answers

How do I calculate percentiles with python/numpy?

Is there a convenient way to calculate percentiles for a sequence or single-dimensional numpy array? I am looking for something similar to Excel's percentile function. I looked in NumPy's statistics reference, and couldn't find this. All I could…
Uri
  • 88,451
  • 51
  • 221
  • 321
262
votes
50 answers

Simple way to calculate median with MySQL

What's the simplest (and hopefully not too slow) way to calculate the median with MySQL? I've used AVG(x) for finding the mean, but I'm having a hard time finding a simple way of calculating the median. For now, I'm returning all the rows to PHP,…
davr
  • 18,877
  • 17
  • 76
  • 99
257
votes
5 answers

np.mean() vs np.average() in Python NumPy?

I notice that In [30]: np.mean([1, 2, 3]) Out[30]: 2.0 In [31]: np.average([1, 2, 3]) Out[31]: 2.0 However, there should be some differences, since after all they are two different functions. What are the differences between them?
Sibbs Gambling
  • 19,274
  • 42
  • 103
  • 174
241
votes
10 answers

Find p-value (significance) in scikit-learn LinearRegression

How can I find the p-value (significance) of each coefficient? lm = sklearn.linear_model.LinearRegression() lm.fit(x,y)
elplatt
  • 3,227
  • 3
  • 18
  • 20
223
votes
18 answers

Calculating Pearson correlation and significance in Python

I am looking for a function that takes as input two lists, and returns the Pearson correlation, and the significance of the correlation.
ariel
  • 2,239
  • 2
  • 14
  • 3
192
votes
13 answers

Fitting empirical distribution to theoretical ones with Scipy (Python)?

INTRODUCTION: I have a list of more than 30,000 integer values ranging from 0 to 47, inclusive, e.g.[0,0,0,0,..,1,1,1,1,...,2,2,2,2,...,47,47,47,...] sampled from some continuous distribution. The values in the list are not necessarily in order, but…
s_sherly
  • 2,307
  • 4
  • 19
  • 14
189
votes
14 answers

Workflow for statistical analysis and report writing

Does anyone have any wisdom on workflows for data analysis related to custom report writing? The use-case is basically this: Client commissions a report that uses data analysis, e.g. a population estimate and related maps for a water district. The…
forkandwait
  • 5,041
  • 7
  • 23
  • 22
181
votes
6 answers

Compute a confidence interval from sample data

I have sample data which I would like to compute a confidence interval for, assuming a normal distribution. I have found and installed the numpy and scipy packages and have gotten numpy to return a mean and standard deviation (numpy.mean(data) with…
Bmayer0122
  • 2,138
  • 2
  • 14
  • 7
181
votes
3 answers

How to make execution pause, sleep, wait for X seconds in R?

How do you pause an R script for a specified number of seconds or miliseconds? In many languages, there is a sleep function, but ?sleep references a data set. And ?pause and ?wait don't exist. The intended purpose is for self-timed animations. The…
Dan Goldstein
  • 24,229
  • 18
  • 37
  • 41
153
votes
15 answers

Multiple linear regression in Python

I can't seem to find any python libraries that do multiple regression. The only things I find only do simple regression. I need to regress my dependent variable (y) against several independent variables (x1, x2, x3, etc.). For example, with this…
Zach
  • 4,624
  • 13
  • 43
  • 60
1
2 3
99 100