AP®PsychologyCollege BoardRevision NotesResearch MethodsData Handling: Inferential StatisticsStatistical Testing

Statistical Testing (College Board AP® Psychology): Revision Note

Written by: Raj Bonsor

Reviewed by: Claire Neeson

Updated on 29 March 2026

Statistical testing & statistical significance

Statistical testing

Statistical testing is used to determine whether the results of a study reflect a real effect or relationship between variables, or whether they are likely to have occurred by chance
Psychologists use two key tools to evaluate their results:
- Statistical significance — determines whether a result is likely to be real rather than due to chance
- Effect size — determines how meaningful or practically important that result is
Both measures must be considered together when evaluating research findings
- A result can be statistically significant without being practically meaningful, and vice versa

Statistical significance

Statistical significance refers to the likelihood that the results of a study are due to the variables being studied rather than to chance
Statistical significance is expressed using a p value — the probability that the results occurred by chance:
- p ≤ 0.05 — there is a 5% or less probability that the results occurred by chance
  - This is the standard threshold for statistical significance in psychological research
- p ≤ 0.01 — there is a 1% or less probability that the results occurred by chance
  - This is a more stringent threshold used when greater confidence is required
If a result meets the significance threshold:
- The result is described as statistically significant
- The researcher rejects the null hypothesis
  - They can conclude that the IV did affect the DV, or that a genuine relationship exists between co-variables
If a result does not meet the significance threshold:
- The result is described as not statistically significant
- The researcher retains the null hypothesis
  - They can conclude that the findings may be due to chance rather than the variables being studied

When is a more stringent threshold used?

Researchers may use p ≤ 0.01 rather than p ≤ 0.05 when:
- There is a significant human cost involved, e.g.
  - drug trials where a false positive could lead to harmful treatments being approved
  - existing research evidence is contradictory and greater certainty is required before drawing conclusions

Type I & Type II errors

When deciding whether to reject or retain the null hypothesis, researchers risk making one of two errors:
- Type I or Type II

Type I error — false positive

A Type I error occurs when the researcher rejects the null hypothesis when it should have been retained
- They conclude that a result is significant when it is actually due to chance
A Type I error is more likely when the significance threshold is set too loosely
- E.g. using p ≤ 0.10 instead of p ≤ 0.05 means the researcher is more willing to accept borderline results as significant, increasing the risk of a false positive
Example:
- A researcher concludes that a new memory technique significantly improves recall using p ≤ 0.10
  - Because the threshold is too generous, the result may actually be due to chance rather than the technique itself

Type II error — false negative

A Type II error occurs when the researcher retains the null hypothesis when it should have been rejected
- They conclude that a result is not significant when a real effect actually exists
A Type II error is more likely when the significance threshold is set too stringently
- E.g. using p ≤ 0.01 instead of p ≤ 0.05 means the researcher requires very strong evidence before accepting a result as significant, increasing the risk of missing a genuine effect
Example:
- A researcher concludes that a new therapy has no significant effect on anxiety using p ≤ 0.01
  - But, the therapy may actually work; the overly stringent threshold has caused the researcher to miss a genuine finding

The relationship between Type I and Type II errors

There is an inherent trade-off between Type I and Type II errors
- Setting the threshold too loosely increases the risk of a Type I Error
- Setting it too stringently increases the risk of a Type II Error
Using the standard p ≤ 0.05 threshold balances the risk of both errors
- It is neither too loose nor too stringent for most psychological research

	Null hypothesis true	Null hypothesis false
Reject null hypothesis	Type I error (false positive)	Correct decision
Retain null hypothesis	Correct decision	Type II error (false negative)

Effect size

Effect size is a numerical measure that indicates the strength of a relationship between variables in a non-experimental study, or the size of the difference between conditions in an experimental study
- Effect size tells us how meaningful or practically significant a finding is independently of whether the result is statistically significant
Effect size is expressed on a standardized scale:
- Small effect size = 0.2 and below
  - The relationship or difference is weak and may have limited practical significance
- Medium effect size = 0.3–0.7
  - The relationship or difference is moderate and meaningful
- Large effect size = 0.8 and above
  - The relationship or difference is strong and practically significant
Example:
- A study finds that a new CBT program reduces depression scores with an effect size of 0.9
  - This is a large effect size, indicating that CBT produced a substantial and meaningful reduction in depression
- A study finds that a new memory technique improves recall with an effect size of 0.15
  - This is a small effect size, indicating that the technique had only a minimal practical impact on memory performance

Statistical significance vs. effect size

Statistical significance and effect size measure different things and must both be considered when evaluating research findings:

	Statistical Significance	Effect Size
What it measures	Whether results are likely due to chance	How meaningful or practically important the result is
Expressed as	p value (e.g. p ≤ 0.05)	Numerical value (e.g. 0.2, 0.5, 0.8)
Can be significant without the other?	Yes — a large sample can produce statistical significance with a tiny effect size	Yes — a small sample may show a large effect size that does not reach statistical significance

A result that is both statistically significant and has a large effect size provides the strongest possible evidence that a finding is both real and practically meaningful
A result that is statistically significant but has a small effect size may be real but of limited practical importance
- This is particularly the case in studies with very large sample sizes, where even trivial differences can reach statistical significance
Example:
- A study with 10,000 participants finds that listening to classical music improves exam scores by an average of 0.5 points
  - The result is statistically significant (p ≤ 0.05) but the effect size is tiny (0.05), meaning the finding has virtually no practical value
- A study with 30 participants finds that a new anxiety intervention reduces anxiety scores by 15 points
  - The effect size is large (0.85) but the result does not reach statistical significance due to the small sample size
    - Further research with a larger sample is warranted

Statistical testing & the evolution of scientific conclusions

Statistical significance and effect size are central to how psychological conclusions evolve through peer review and replication:
- During peer review, experts evaluate whether the statistical methods used are appropriate and whether the significance threshold applied is justified
- A finding that is statistically significant with a large effect size is more likely to survive peer review and be accepted for publication
- When a study is replicated and produces consistent levels of statistical significance and similar effect sizes, confidence in the finding increases
- Meta-analysis synthesizes effect sizes across multiple replicated studies
  - This produices a pooled effect size that is more reliable and generalizable than any single study could provide
- A finding that is statistically significant in one study but fails to replicate ( or produces a much smaller effect size in replication) signals that the original finding may have been a Type I error

Unlock more, it's free!

Join the 100,000+ Students that ❤️ Save My Exams

the (exam) results speak for themselves:

I would just like to say a massive thank you for putting together such a brilliant, easy to use website.I really think using this site helped me secure my top gradesin science and maths. You really did save my exams! Thank you.

Beth
IGCSE Student

This website is soooo useful and I can’t ever thank you enough for organising questions by topic like this. Furthermore, the name of the website could not have been more appropriate as it literally did SAVE MY EXAMS!

Fathima
A Level Student

Incredible! SO worth my money, the revision notes have everything I need to know and are so easy to understand. I actually enjoy revising! It makes me feel a lot more confident for my GCSEs in a few months.

Kate
GCSE Student

Absolutely brilliant, both my girls used it for A levels and GCSE. It's saves on paper copies, also beneficial exam questions ranked from easy to hard. It's removed a lot of stress from the exams.

Sameera
Parent

Just to say that your resources are the best I have seen and I have been teaching chemistry at different levels for about 40 years

Mark
Chemistry Teacher

Excellent