Statistical Testing (College Board AP® Psychology): Revision Note
Statistical testing & statistical significance
Statistical testing
Statistical testing is used to determine whether the results of a study reflect a real effect or relationship between variables, or whether they are likely to have occurred by chance
Psychologists use two key tools to evaluate their results:
Statistical significance — determines whether a result is likely to be real rather than due to chance
Effect size — determines how meaningful or practically important that result is
Both measures must be considered together when evaluating research findings
A result can be statistically significant without being practically meaningful, and vice versa
Statistical significance
Statistical significance refers to the likelihood that the results of a study are due to the variables being studied rather than to chance
Statistical significance is expressed using a p value — the probability that the results occurred by chance:
p ≤ 0.05 — there is a 5% or less probability that the results occurred by chance
This is the standard threshold for statistical significance in psychological research
p ≤ 0.01 — there is a 1% or less probability that the results occurred by chance
This is a more stringent threshold used when greater confidence is required
If a result meets the significance threshold:
The result is described as statistically significant
The researcher rejects the null hypothesis
They can conclude that the IV did affect the DV, or that a genuine relationship exists between co-variables
If a result does not meet the significance threshold:
The result is described as not statistically significant
The researcher retains the null hypothesis
They can conclude that the findings may be due to chance rather than the variables being studied
When is a more stringent threshold used?
Researchers may use p ≤ 0.01 rather than p ≤ 0.05 when:
There is a significant human cost involved, e.g.
drug trials where a false positive could lead to harmful treatments being approved
existing research evidence is contradictory and greater certainty is required before drawing conclusions
Type I & Type II errors
When deciding whether to reject or retain the null hypothesis, researchers risk making one of two errors:
Type I or Type II
Type I error — false positive
A Type I error occurs when the researcher rejects the null hypothesis when it should have been retained
They conclude that a result is significant when it is actually due to chance
A Type I error is more likely when the significance threshold is set too loosely
E.g. using p ≤ 0.10 instead of p ≤ 0.05 means the researcher is more willing to accept borderline results as significant, increasing the risk of a false positive
Example:
A researcher concludes that a new memory technique significantly improves recall using p ≤ 0.10
Because the threshold is too generous, the result may actually be due to chance rather than the technique itself
Type II error — false negative
A Type II error occurs when the researcher retains the null hypothesis when it should have been rejected
They conclude that a result is not significant when a real effect actually exists
A Type II error is more likely when the significance threshold is set too stringently
E.g. using p ≤ 0.01 instead of p ≤ 0.05 means the researcher requires very strong evidence before accepting a result as significant, increasing the risk of missing a genuine effect
Example:
A researcher concludes that a new therapy has no significant effect on anxiety using p ≤ 0.01
But, the therapy may actually work; the overly stringent threshold has caused the researcher to miss a genuine finding
The relationship between Type I and Type II errors
There is an inherent trade-off between Type I and Type II errors
Setting the threshold too loosely increases the risk of a Type I Error
Setting it too stringently increases the risk of a Type II Error
Using the standard p ≤ 0.05 threshold balances the risk of both errors
It is neither too loose nor too stringent for most psychological research
Null hypothesis true | Null hypothesis false | |
|---|---|---|
Reject null hypothesis | Type I error (false positive) | Correct decision |
Retain null hypothesis | Correct decision | Type II error (false negative) |
Effect size
Effect size is a numerical measure that indicates the strength of a relationship between variables in a non-experimental study, or the size of the difference between conditions in an experimental study
Effect size tells us how meaningful or practically significant a finding is independently of whether the result is statistically significant
Effect size is expressed on a standardized scale:
Small effect size = 0.2 and below
The relationship or difference is weak and may have limited practical significance
Medium effect size = 0.3–0.7
The relationship or difference is moderate and meaningful
Large effect size = 0.8 and above
The relationship or difference is strong and practically significant
Example:
A study finds that a new CBT program reduces depression scores with an effect size of 0.9
This is a large effect size, indicating that CBT produced a substantial and meaningful reduction in depression
A study finds that a new memory technique improves recall with an effect size of 0.15
This is a small effect size, indicating that the technique had only a minimal practical impact on memory performance
Statistical significance vs. effect size
Statistical significance and effect size measure different things and must both be considered when evaluating research findings:
Statistical Significance | Effect Size | |
|---|---|---|
What it measures | Whether results are likely due to chance | How meaningful or practically important the result is |
Expressed as | p value (e.g. p ≤ 0.05) | Numerical value (e.g. 0.2, 0.5, 0.8) |
Can be significant without the other? | Yes — a large sample can produce statistical significance with a tiny effect size | Yes — a small sample may show a large effect size that does not reach statistical significance |
A result that is both statistically significant and has a large effect size provides the strongest possible evidence that a finding is both real and practically meaningful
A result that is statistically significant but has a small effect size may be real but of limited practical importance
This is particularly the case in studies with very large sample sizes, where even trivial differences can reach statistical significance
Example:
A study with 10,000 participants finds that listening to classical music improves exam scores by an average of 0.5 points
The result is statistically significant (p ≤ 0.05) but the effect size is tiny (0.05), meaning the finding has virtually no practical value
A study with 30 participants finds that a new anxiety intervention reduces anxiety scores by 15 points
The effect size is large (0.85) but the result does not reach statistical significance due to the small sample size
Further research with a larger sample is warranted
Statistical testing & the evolution of scientific conclusions
Statistical significance and effect size are central to how psychological conclusions evolve through peer review and replication:
During peer review, experts evaluate whether the statistical methods used are appropriate and whether the significance threshold applied is justified
A finding that is statistically significant with a large effect size is more likely to survive peer review and be accepted for publication
When a study is replicated and produces consistent levels of statistical significance and similar effect sizes, confidence in the finding increases
Meta-analysis synthesizes effect sizes across multiple replicated studies
This produices a pooled effect size that is more reliable and generalizable than any single study could provide
A finding that is statistically significant in one study but fails to replicate ( or produces a much smaller effect size in replication) signals that the original finding may have been a Type I error
Unlock more, it's free!
Was this revision note helpful?