You may often hear about statistical hypothesis tests. More than in other places, they are encountered in science. There, serving as a scientific instrument to express the confidence of a scientific statement. So what is it about, and how do we do it?

A statistical hypothesis test regards the outcome of the observations. Precisely, it regards whether there is no difference between specific characteristics of a population. Or, if there is a slight difference, we hypothesise it must be a due chance alone. Sounds familiar? This kind of hypothesis is called the null hypothesis (H0). While alternative (everything against null hypothesis), well, it is called alternative hypothesis H1 or Ha.

The test is often analysed by employing a p-value. The value is the probability of obtaining test results at least as extreme as the results observed, under the assumption that the null hypothesis is correct. For example: Are boys taller than girls at age eight? The null hypothesis is that “they are the same average height.”

When analysing the p-value of a significance test, we must establish a significance level, usually referred to as the Greek lower case letter alpha (a). A standard value for the significance level is 5%, reported as 0.05, and formally represents the frontier for specifying a statistically significant finding.

A significance test is “statistically significant” if the p-value is less than the significance level. In this case, the null hypothesis is rejected. Thus:

p <= alpha: reject H0, different distribution.

p > alpha: fail to reject H0, same distribution.

Be aware that the p-value is just a probability. And when dealing with probabilities, the event can go both directions, correct and not correct. In our context, the test could be wrong, and with it, our interpretation of the results.

There are two types of errors; they are:

Type I Error: Reject the null hypothesis when there is no significant effect (false positive). The p-value is optimistically small.
Type II Error: Not reject the null hypothesis when there is a significant effect (false negative). The p-value is pessimistically large.

So it must be something that helps us gain confidence whether we correctly reject the null hypothesis, right?. Yes, a probability (not expecting it right) measures this confidence, formally called statistical power.

Statistical power has relevance only when the null is false, as it is the probability that a test will correctly reject a false null hypothesis.

The higher the statistical power for a given experiment, the lower the probability of making a Type II (false negative) error. The higher the probability of detecting an effect when there is an effect. The power is precisely the inverse of a Type II error probability.

Power = 1 - Type II Error
Prob.(True Positive) = 1 - Prob.(False Negative)

When interpreting statistical power, we seek experiential setups with high statistical power.

Low Statistical Power: Large risk of committing Type II errors, e.g. a false negative.
High Statistical Power: Small risk of committing Type II errors.

Experimental results with too low statistical power will lead to invalid conclusions about the meaning of the results. Therefore a minimum level of statistical power must be reached.

As a good start, it is good practice to use reasonable defaults for some parameters, such as a significance level of 0.05 and a power level of 0.80. A power analysis can then estimate the minimum sample size required for the particular experiment. For more information, here come the references:

References

Wikipedia
Article: A Gentle Introduction to Statistical Power and Power Analysis in Python
Book: Practical Statistics for Data Scientists