Reliability (College Board AP® Psychology): Revision Note
Reliability
Reliability refers to the consistency of a measure or procedure
A study is reliable if it produces similar results when repeated under the same conditions
If a study is replicated and produces similar results, this demonstrates that the measure is consistent and not subject to significant fluctuation
Two types of reliability include:
Internal reliability — the extent to which a measure is consistent within itself
External reliability — the extent to which a measure is consistent over time and across different occasions
Reliability is essential to the scientific process in psychology
Replication is the primary means by which researchers verify that findings are consistent and not the result of chance or error
Unreliable findings cannot be confidently used to draw conclusions about psychological phenomena, and are unlikely to survive the peer review process
Reliability across research methods
Different research methods vary in their level of reliability:
Lab experiments tend to be the most reliable
They use standardized procedures, controlled conditions, and random assignment, making them easier to replicate and producing quantitative data that can be directly compared across studies
Field experiments are less reliable than lab experiments
Although they implement an IV and produce quantitative data, they are subject to uncontrolled extraneous variables that are difficult to replicate exactly
Natural experiments and quasi-experiments are less reliable still
The naturally occurring IV cannot be controlled or replicated by the researcher, meaning conditions are unlikely to be identical across replications
Observational studies, surveys, and interviews vary in reliability depending on how well the procedure is standardized and how clearly variables are operationally defined
Measuring reliability
There are three main methods for measuring reliability, each suited to a different type of research:
the test-retest method
the split-half method
inter-rater reliability
Test-retest method
The test-retest method measures external reliability:
The same participants complete the same measure on two separate occasions, with a time gap between sessions (e.g. six months)
If each participant produces a similar score on both occasions, external reliability is established — the measure is consistent over time
Used to assess the reliability of surveys, questionnaires, and psychological scales
Split-half method
The split-half method measures internal reliability:
The researcher divides the measure in half and compares participants' responses to the first half with their responses to the second half
If similar responses are given across both halves, internal reliability is established — the measure is consistent within itself
Used to assess the internal consistency of surveys and psychological scales
Inter-rater reliability
Inter-rater reliability measures the level of consistency between two or more trained observers independently recording the same observation
How it is established:
All observers agree on the behavioral categories and how they will be recorded before the observation begins
Each observer conducts the observation independently to avoid one influencing the other
After the observation, the two independent data sets are compared
A correlation is calculated between the two sets of scores — a strong positive correlation indicates good inter-rater reliability
If inter-rater reliability is low, behavioral categories are reviewed and refined before the observation is repeated
Good inter-rater reliability reduces the risk that researcher bias has distorted the findings
Improving reliability
If reliability is measured and found to be low, the researcher must take steps to improve it before the study is conducted or repeated
The appropriate improvement strategy depends on the research method being used
Lab and field experiments
Ensure all aspects of the procedure are fully standardized
Same instructions, same environment, same materials, same timing across all conditions
Ensure the IV and DV are clearly operationally defined so the study can be precisely replicated
Observational studies
Ensure behavioral categories are clearly operationally defined and measure only directly observable behavior
Ensure behavioral categories are mutually exclusive with no overlap or ambiguity
Use more than one observer and establish inter-rater reliability before the main observation begins
Surveys
Run the test-retest method and revise or remove any questions that produce inconsistent scores across sessions
Replace ambiguous open questions with clearly worded closed questions or Likert scale items that are less open to interpretation
Interviews
Use the same interviewer across all participants to reduce variability in delivery
Ensure interviewers are trained and follow a consistent approach
Remove leading questions, double-barreled questions, and ambiguous wording from the interview schedule
Reliability & the evolution of scientific conclusions
Reliability is fundamental to how psychological conclusions evolve through peer review and replication:
When a study is submitted for peer review, other experts in the field evaluate whether the methodology is sufficiently reliable to support the conclusions drawn
If a study cannot be replicated or produces inconsistent results, its findings will be challenged or rejected during peer review
When multiple independent replications of a study produce consistent findings, confidence in those conclusions increases — this is how psychological knowledge is built and refined over time
Unreliable findings, even if statistically significant, cannot contribute meaningfully to the scientific evidence base, because they cannot be consistently reproduced
Examiner Tips and Tricks
Ensure that you understand these key points:
Reliability and validity are not the same thing — a measure can be reliable without being valid
E.g. a bathroom scale that consistently overestimates weight by 5 pounds is reliable but not valid
A study does not need to produce identical results to be considered reliable — some variation is expected
Reliability requires that results are similar, not identical, across replications
Inter-rater reliability does not guarantee validity — two observers can consistently agree on what they are recording while still recording the wrong thing if the behavioral categories are poorly designed
Unlock more, it's free!
Was this revision note helpful?