Statistical Testing (College Board AP® Psychology): Revision Note

Raj Bonsor

Written by: Raj Bonsor

Reviewed by: Claire Neeson

Updated on

Statistical testing & statistical significance

Statistical testing

  • Statistical testing is used to determine whether the results of a study reflect a real effect or relationship between variables, or whether they are likely to have occurred by chance

  • Psychologists use two key tools to evaluate their results:

    • Statistical significance — determines whether a result is likely to be real rather than due to chance

    • Effect size — determines how meaningful or practically important that result is

  • Both measures must be considered together when evaluating research findings

    • A result can be statistically significant without being practically meaningful, and vice versa

Statistical significance

  • Statistical significance refers to the likelihood that the results of a study are due to the variables being studied rather than to chance

  • Statistical significance is expressed using a p value — the probability that the results occurred by chance:

    • p ≤ 0.05 — there is a 5% or less probability that the results occurred by chance

      • This is the standard threshold for statistical significance in psychological research

    • p ≤ 0.01 — there is a 1% or less probability that the results occurred by chance

      • This is a more stringent threshold used when greater confidence is required

  • If a result meets the significance threshold:

    • The result is described as statistically significant

    • The researcher rejects the null hypothesis

      • They can conclude that the IV did affect the DV, or that a genuine relationship exists between co-variables

  • If a result does not meet the significance threshold:

    • The result is described as not statistically significant

    • The researcher retains the null hypothesis

      • They can conclude that the findings may be due to chance rather than the variables being studied

When is a more stringent threshold used?

  • Researchers may use p ≤ 0.01 rather than p ≤ 0.05 when:

    • There is a significant human cost involved, e.g.

      • drug trials where a false positive could lead to harmful treatments being approved

      • existing research evidence is contradictory and greater certainty is required before drawing conclusions

Type I & Type II errors

  • When deciding whether to reject or retain the null hypothesis, researchers risk making one of two errors:

    • Type I or Type II

Type I error — false positive

  • A Type I error occurs when the researcher rejects the null hypothesis when it should have been retained

    • They conclude that a result is significant when it is actually due to chance

  • A Type I error is more likely when the significance threshold is set too loosely

    • E.g. using p ≤ 0.10 instead of p ≤ 0.05 means the researcher is more willing to accept borderline results as significant, increasing the risk of a false positive

  • Example:

    • A researcher concludes that a new memory technique significantly improves recall using p ≤ 0.10

      • Because the threshold is too generous, the result may actually be due to chance rather than the technique itself

Type II error — false negative

  • A Type II error occurs when the researcher retains the null hypothesis when it should have been rejected

    • They conclude that a result is not significant when a real effect actually exists

  • A Type II error is more likely when the significance threshold is set too stringently

    • E.g. using p ≤ 0.01 instead of p ≤ 0.05 means the researcher requires very strong evidence before accepting a result as significant, increasing the risk of missing a genuine effect

  • Example:

    • A researcher concludes that a new therapy has no significant effect on anxiety using p ≤ 0.01

      • But, the therapy may actually work; the overly stringent threshold has caused the researcher to miss a genuine finding

The relationship between Type I and Type II errors

  • There is an inherent trade-off between Type I and Type II errors

    • Setting the threshold too loosely increases the risk of a Type I Error

    • Setting it too stringently increases the risk of a Type II Error

  • Using the standard p ≤ 0.05 threshold balances the risk of both errors

    • It is neither too loose nor too stringent for most psychological research

Null hypothesis true

Null hypothesis false

Reject null hypothesis

Type I error (false positive)

Correct decision

Retain null hypothesis

Correct decision

Type II error (false negative)

Effect size

  • Effect size is a numerical measure that indicates the strength of a relationship between variables in a non-experimental study, or the size of the difference between conditions in an experimental study

    • Effect size tells us how meaningful or practically significant a finding is independently of whether the result is statistically significant

  • Effect size is expressed on a standardized scale:

    • Small effect size = 0.2 and below

      • The relationship or difference is weak and may have limited practical significance

    • Medium effect size = 0.3–0.7

      • The relationship or difference is moderate and meaningful

    • Large effect size = 0.8 and above

      • The relationship or difference is strong and practically significant

  • Example:

    • A study finds that a new CBT program reduces depression scores with an effect size of 0.9

      • This is a large effect size, indicating that CBT produced a substantial and meaningful reduction in depression

    • A study finds that a new memory technique improves recall with an effect size of 0.15

      • This is a small effect size, indicating that the technique had only a minimal practical impact on memory performance

Statistical significance vs. effect size

  • Statistical significance and effect size measure different things and must both be considered when evaluating research findings:

Statistical Significance

Effect Size

What it measures

Whether results are likely due to chance

How meaningful or practically important the result is

Expressed as

p value (e.g. p ≤ 0.05)

Numerical value (e.g. 0.2, 0.5, 0.8)

Can be significant without the other?

Yes — a large sample can produce statistical significance with a tiny effect size

Yes — a small sample may show a large effect size that does not reach statistical significance

  • A result that is both statistically significant and has a large effect size provides the strongest possible evidence that a finding is both real and practically meaningful

  • A result that is statistically significant but has a small effect size may be real but of limited practical importance

    • This is particularly the case in studies with very large sample sizes, where even trivial differences can reach statistical significance

  • Example:

    • A study with 10,000 participants finds that listening to classical music improves exam scores by an average of 0.5 points

      • The result is statistically significant (p ≤ 0.05) but the effect size is tiny (0.05), meaning the finding has virtually no practical value

    • A study with 30 participants finds that a new anxiety intervention reduces anxiety scores by 15 points

      • The effect size is large (0.85) but the result does not reach statistical significance due to the small sample size

        • Further research with a larger sample is warranted

Statistical testing & the evolution of scientific conclusions

  • Statistical significance and effect size are central to how psychological conclusions evolve through peer review and replication:

    • During peer review, experts evaluate whether the statistical methods used are appropriate and whether the significance threshold applied is justified

    • A finding that is statistically significant with a large effect size is more likely to survive peer review and be accepted for publication

    • When a study is replicated and produces consistent levels of statistical significance and similar effect sizes, confidence in the finding increases

    • Meta-analysis synthesizes effect sizes across multiple replicated studies

      • This produices a pooled effect size that is more reliable and generalizable than any single study could provide

    • A finding that is statistically significant in one study but fails to replicate ( or produces a much smaller effect size in replication) signals that the original finding may have been a Type I error

Unlock more, it's free!

Join the 100,000+ Students that ❤️ Save My Exams

the (exam) results speak for themselves:

Raj Bonsor

Author: Raj Bonsor

Expertise: Psychology & Sociology Content Creator

Raj joined Save My Exams in 2024 as a Senior Content Creator for Psychology & Sociology. Prior to this, she spent fifteen years in the classroom, teaching hundreds of GCSE and A Level students. She has experience as Subject Leader for Psychology and Sociology, and her favourite topics to teach are research methods (especially inferential statistics!) and attachment. She has also successfully taught a number of Level 3 subjects, including criminology, health & social care, and citizenship.

Claire Neeson

Reviewer: Claire Neeson

Expertise: Psychology Content Creator

Claire has been teaching for 34 years, in the UK and overseas. She has taught GCSE, A-level and IB Psychology which has been a lot of fun and extremely exhausting! Claire is now a freelance Psychology teacher and content creator, producing textbooks, revision notes and (hopefully) exciting and interactive teaching materials for use in the classroom and for exam prep. Her passion (apart from Psychology of course) is roller skating and when she is not working (or watching 'Coronation Street') she can be found busting some impressive moves on her local roller rink.