Collecting Data (Edexcel GCSE Statistics)
Revision Note
Written by: Roger B
Reviewed by: Dan Finlay
Data Collection Basics
What are different ways to collect data?
You should be familiar with different methods of collecting data
You can use direct observation to collect data
This means observing the things you are interested in and recording what you observe
For example to study pedestrians' use of mobile phones you might observe people walking past a certain spot in a town and tally the numbers who are or aren't looking at a mobile phone while they walk
You will need an appropriate data collection sheet for recording your data
This will usually be a table or tally chart, with appropriate rows or columns for the data you are collecting
For the example above you could use a tally chart, with rows for 'looking at a mobile phone' and 'not looking at a mobile phone'
An advantage of observation can be not affecting the natural behaviour of the things you are observing
But a possible disadvantage is not having any control over the things you are studying
You can also conduct an experiment to collect data
This is done to see how changes in one variable (the explanatory or independent variable) affect another variable (the response or dependent variable)
It is important to control extraneous variables (see the 'Extraneous Variable' spec point)
Different types of experiment (laboratory experiments, field experiments, and natural experiments) have different advantages and disadvantages
including different levels of control for extraneous variables
Sometimes a pre-test will be used before starting on a full experiment
The intended experiment is run on a small sample
This may reveal any problems with the design of the experiment
And allow the problems to be fixed before the experiment is run for real
Simulation can be used to model events in the real world
Data is collected from the model to predict what would happen in the real world
This may be easier or cheaper than collecting real world data
Random processes may be involved (including the use of random numbers)
For example say 23% of the UK population possesses a certain genetic marker
A two-digit random number generator could serve as a model 'person'
A number from 00 to 22 means the 'person' has the genetic marker, and a number from 23 to 99 means they don't
You can gather data from individuals using questionnaires or interviews
These need to be used carefully to avoid bias or other possible issues
See the 'Questionnaires & Interviews' spec point
You can also use reference sources to collect secondary data
e.g., government census data, online sources, etc.
Remember that the source of secondary data needs to be acknowledged
See the 'Types of Data' revision note
What are the advantages and disadvantages of different kinds of experiment?
You should know the advantages of different types of experiment for collecting data
Laboratory experiments
Conducted in a controlled environment (it doesn't have to happen in an official laboratory!)
For example studying people's sleep patterns in a special room where lighting, temperature, bedding materials, etc. are all under the researchers' control
Advantages include
Easy to control extraneous variables
Easy to repeat the experiment under exactly the same conditions
Disadvantages include
Test subjects may not behave naturally in the controlled environment
Field experiments
Conducted in the subject's usual environment, but with the researcher controlling the situation and certain variables
For example studying people's sleep patterns in their own beds at home, but with the researchers providing specific types of pillow and deciding what time subjects should go to bed
Advantages include
More likely than a laboratory experiment to show usual or natural behaviour
Disadvantages include
Can't control all extraneous variables
Harder to repeat the experiment under exactly the same conditions
Natural experiments
Conducted in the subject's usual environment, without the researcher controlling the situation or variables
For example studying people's sleep patterns in their own beds at home, with the subjects using their own beds and bedding, going to sleep at their usual times, etc.
Advantages include
More likely than a laboratory experiment to show usual or natural behaviour
Disadvantages include
Can't control any extraneous variables
Harder to repeat the experiment under exactly the same conditions
What are validity and reliability with regards to collected data?
We say that data is reliable when repeated measurements give similar results
i.e. if you collected the data again under similar circumstances you would get similar results
For example, using a scale to weigh some samples
It should give the same result if the same sample is weighed again
The reliability of collected data is the extent to which this is true
We say that data is valid if it measures what it was intended to measure
i.e. the data should be telling you what you think it is telling you
For example, using a questionnaire to assess participants' stress levels
To be valid, scores from the questionnaire should agree with other accepted ways of measuring stress
The validity of collected data is the extent to which this is true
Reliability and validity are both very important for collected data
The more reliable and valid data is, the more we can trust any predictions or conclusions made from it
Worked Example
Tomas is a researcher studying obedience in pet dogs. He plans to study 8 different dogs. For each dog, he will first visit the dog at its home, ask it to perform 10 basic commands, and record how many the dog successfully carries out. Two days later, Tomas will visit each dog at home a second time, ask it to do the same 10 commands, and record how many the dog successfully carries out.
(a) Design a data collection sheet that Tomas could use to record the results of his experiment.
Tomas will need to record the data for the 8 different dogs
For each dog he will need to record two different data values (the number of commands successfully carried out on each visit)
The best way to do this will be in a table
Dog | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 |
1st visit | ||||||||
2nd visit |
(b) Explain whether Tomas is conducting a laboratory experiment, a field experiment, or a natural experiment.
He is visiting the dogs at their homes, so he is not carrying out a laboratory experiment
He is controlling what the dogs are asked to do on each visit, so it is not a natural experiment
Tomas is visiting the dogs in their home environments, but he is also controlling what they are asked to do on each visit. Therefore it is a field experiment.
(c) Explain what Tomas has done to help assure the reliability of his experimental results.
Tomas is asking the dogs to perform the same 10 commands each time
He is testing them in the same setting (their home) each time
If his experiment is reliable he should get approximately the same results on both visits
He is visiting each dog twice. Both visits are in the dogs' homes, and they are asked to do the same 10 commands each time. This will test whether he gets similar results for each dog when tested in similar circumstances, and help to show whether his results are reliable.
Questionnaires & Interviews
What makes a good questionnaire?
A questionnaire contains a set of questions that are used to collect data
A person who completes a questionnaire is known as a respondent
You should know the difference between open and closed questions
An open question has no suggested answers, and a respondent can answer anything at all to them
For example, 'How do you think the current town council is doing?'
Every answer can be different
so it can be hard to summarise or analyse the data as a whole
A closed question offers the respondent a number of answers to choose from
For example, 'The current town council is doing a great job. Choose one: ☐ Agree ☐ Disagree'
It is possible to record how many people choose each response
This makes it easier to summarise and analyse the data
Closed questions will often use an opinion scale
For example offering the options 'strongly agree', 'agree', disagree' and 'strongly disagree'
A problem with opinion scales is that most people tend to choose responses 'in the middle', so the data collected might be biased towards those middle values
There are a number of things to consider when creating a questionnaire
Avoid leading questions
These are questions that suggest a particular answer
For example 'How delighted are you with our awesome new product?'
This is 'leading' the respondent to give a positive answer
The responses collected are likely to be biased
Make sure that options offered cover all possibilities
For example, 'How many time per day do you use our app? ☐ 1 time ☐ 2 times ☐ 3-5 times'
This doesn't offer '0' or 'more than 5' as options
You may need to include options like 'never', 'other' or 'I don't know'
Make sure any intervals given do not overlap
For example, 'How much do you spend per month on widgets? ☐ £0 to £5 ☐ £5 to £10 ☐ More than £10'
'£5' is included in the first and second options!
Be sure to be specific about time frames
For example, 'How many text messages do you send per week?' is better than 'How many text messages do you send?'
Keep questions short
and use language that is simple and easy to understand
Be careful about asking sensitive questions
i.e. questions about personal matters (age, etc.) or about things people may not want to discuss ('How many times have you stolen things from shops?')
People may not answer the questions
Or they may not answer them honestly
Sometimes a pilot survey will be used before giving the questionnaire to all the respondents in the intended survey
The questionnaire is first given to a smaller sample of people
This may reveal any problems with the design of the questionnaire
And allow the problems to be fixed before the questionnaire is used for real
What are the advantages and disadvantages of interviews versus anonymous questionnaires?
In interviews an interviewer asks the questions to the respondents and records their responses
This can be done in person or by phone
Advantages of interviews:
The response rate is higher
i.e. every person interviewed will tend to answer the questions
The interviewer can explain questions (if necessary)
The respondent can explain their answers
This avoids unclear or ambiguous answers being recorded
A good interviewer can help respondents feel more comfortable when answering sensitive questions
Disadvantages of interviews:
Conducting interviews can take a lot of time
So interviews can take longer and be more expensive
The sample size will usually be smaller than when using questionnaires
This can make the sample less representative
Respondents may be less likely to be honest or to answer sensitive questions in an in-person interview
Or respondents may try to boast or to give the answers they think the interviewer wants to hear
There may be interviewer bias
This is when the opinions or expectations of the interviewer affect the answers given by the respondent
For example the interviewer may ask a question in a way that leads the respondent towards giving a particular answer
This can lead to biased results
Questionnaires will normally be given to people to fill in anonymously
This can be a printed form or a form accessible online
Advantages of questionnaires:
Respondents can answer questions in their own time
This can make the survey quicker and cheaper to run
Questionnaires can be sent to a large sample
This can make the sample more representative
Respondents may be more likely to be honest and to answer sensitive questions in an anonymous questionnaire
There is no interviewer bias
Disadvantages of questionnaires:
The response rate is lower
People may not answer all the questions, or may not complete or return the questionnaire at all
A respondent may not understand the questions
A respondent's answers may be unclear or ambiguous
What is the random response method for collecting sensitive data?
Even in an anonymous questionnaire, people may not be willing to give honest answers to sensitive questions
The random response method is a way to get better responses for these sorts of questions
It uses some sort of random event (for example a coin flip) to determine how a question will be answered
For example, say you wanted to collect data on people using handheld phones while driving
This is illegal in the UK
So people may not be willing to admit that they have done it
You could ask the question in this form:
"Have you ever driven while using a handheld phone?
Flip a coin.
If you get heads, then answer Yes.
If you get tails, then answer honestly."There is no way to know if a person answering yes really did drive while using a handheld phone, or whether they only answered yes because they flipped the coin and got heads
To estimate the response rate for a random response question:
Estimate the number of people who answered a certain way because of the random event
For example, with a coin flip about half the people will get heads and half will get tails
Remove that many responses from the data set
For the example used above, say 1000 people responded to the question
We would expect half of them to answer yes because they got heads on the coin
So remove 500 yes answers from the data set
Perform your analysis on the remaining items in the data set
See the Worked Example
Worked Example
A researcher is designing a questionnaire in order to collect data on how often people illegally download music.
One question the researcher is thinking of using is the following:
"A lot of people say that downloading music illegally is really okay, because it doesn't hurt anyone. How bad do you think it is to download music illegally?"
(a) State with a reason whether that is an open or a closed question.
Remember, in a closed question respondents are given a fixed set of responses to choose from
A person could give any answer at all to that question, so it is an open question
(b) State one thing that is wrong with the way the question is asked.
They start by saying that a lot of people think it's okay, before asking the actual question
This is leading a respondent towards giving a certain type of answer
It is a leading question, because it starts off by saying that a lot of people think it's okay to download music illegally
In the final version of the questionnaire, one of the questions is as follows:
"Have you ever downloaded music illegally?
Before answering the question, flip a coin.
If you get heads on the coin, then answer Yes.
If you get tails on the coin, then answer honestly."
The questionnaire is sent to a large number of people. 1332 people answer Yes to that question, and 1068 answer No.
(c) Estimate the percentage of people in the sample who have downloaded music illegally.
Start by figuring out the total number of people who responded
The probability of getting heads on a fair coin is
Multiply that by 2400 to estimate the number of people who answered Yes because of the coin flip
Subtract that from 1332 to estimate the number of 'real' Yes responses
All the No responses are 'real', because no one answered No just because of the coin flip
So the total number of 'real' responses is 1068+132
Divide the number of 'real' Yes responses by that to get the proportion of 'real' responses that were Yes
Multiply by 100 to convert to a percentage
11%
Data Problems & Cleaning Data
What sorts of problems can occur with collected data?
A number of problems can occur with data that has been collected
There may be missing data items
For example collecting data for the ages and weights of a number of puppies
For one puppy the researcher wrote down the weight, but forgot to record the age
There may be non-responses
This may be because someone chose not to answer a questionnaire
But it could also be because a member of a sample cannot be reached for some reason
For example questionnaires sent out to all the businesses in a government database
Some may not respond because they have gone out of business
This could mean that struggling or unsuccessful businesses are under-represented in the sample
There may be incomplete responses
People may return a questionnaire, but not answer all the questions
If lots of people don't answer the same question it may be because there's a problem with the question
Data could be in an incorrect format
For example, decimal points in the wrong place, incorrect or inconsistent units used, etc.
Data sets may contain anomalous data values (also known as outliers)
These are data values that are either very large or very small compared to the rest of the data
They may be valid data values
One very high salary in a list of company salaries may belong to the company CEO
Or they may be mistakes
A member of sports club whose age is recorded as 500
It's possible the person is really 50, but the person recording the data put in an extra 0 by mistake
How do I clean data?
Before data can be analysed, it should first be cleaned
Incorrect data values should be identified
and corrected if possible
or otherwise removed
This includes outliers
If you decide an outlier is a mistake it should be removed
But outliers that you think are valid should be kept in the data set
You should decide what to do about missing or incomplete data
It may be possible to find missing data values
For example, ring the person whose puppy was weighed but whose age was not recorded, and find out the age
Incomplete data (for example from an incomplete response to a questionnaire) can be kept or removed
Consider the effect that keeping or removing the data would have on any calculated statistics or other analysis
Units or other symbols may need to be removed from the data
For example removing the 'kg' from a list of weights, or the '£' from a list of prices
This is especially important if using spreadsheets or statistical software to calculate statistics from the data
Final calculations and analysis should be done using the cleaned data set
However be sure to justify why any values have been removed from the data set
Worked Example
At the start of each day, 8 towels are placed in each room in a hotel.
At the end of a particular day, the hotel manager recorded how many towels had gone missing from each room in the hotel. The results are in the table below.
2 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 2 |
3.2 | 3 | 0 | 2 | 15 | 0 | 1 | 0 | 1 | 5 | 0 |
(a) Explain why the value 3.2 in the table must be an error.
You cannot have part of a towel go missing, so valid data values must be whole numbers
Only whole numbers of towels can go missing, and 3.2 is a decimal number that is not a whole number
(b) Explain why 15 is an anomalous data value, and state with a reason whether it should be kept in or removed from the data set.
15 'stands out' because it is so much bigger than all the other values in the set
And remember that only 8 towels are put in each room to begin with
15 is an anomalous data value because it is much bigger than all the other numbers in the table. It should be removed, because only 8 towels are put into each room at the start of the day, so 15 is very likely to be an error.
(c) Clean and rewrite the data.
Remove the 3.2 and the 15, and rewrite the remaining data values
This is the data set you would use to calculate any statistics or to do other analysis
2 | 0 | 0 | 1 | 0 | 0 | 1 | 1 | 0 | 0 |
2 | 3 | 0 | 2 | 0 | 1 | 0 | 1 | 5 | 0 |
Extraneous Variables
What are extraneous variables?
An extraneous variable is a variable that
you are not interested in
but that can affect the results of your experiment
It is important to
identify possible extraneous variables before beginning an experiment
control any extraneous variables that are identified
i.e. eliminate (or at least minimise) their effect on the data
For example an experiment looking at how a new memory technique helps people memorise lists of words
Some people are tested in a quiet room
and some people are tested in a busy café
Background noise is an extraneous variable
People in the café might do worse on the test
But only because they were distracted by the noise
This would not say anything about the memory technique being investigated
Background noise could be controlled by having everyone do the test in the same setting
How can I use control groups to control extraneous variables?
Control groups are often used when testing new treatments
People are randomly selected to be in one of two groups
People in the test group are given the treatment
People in the control group are not given the treatment
The results for the two groups are compared to see how effective the treatment is
The circumstances for the test group and control group should be as similar as possible
This is to control possible extraneous variables
For example, in medical tests the control group may be given an inactive substance (known as a placebo) that looks exactly like the active substance given to the test group
Even the people giving the substance may not know who is getting what
This makes sure everyone's experience in the experiment is as similar as possible
Matched pairs can be used in experiments with control and test groups
Each person in one group is paired with a person in the other group
People who are paired should have as much as possible in common
e.g. age, gender, educational background, annual income, geographic location, etc.
The only thing that should be different is the variable being studied
This is another way to control extraneous variables
But it can be challenging to find enough matched pairs to give a good sample size
Worked Example
A doctor wants to test the effectiveness of a new medicated lotion for treating a skin condition.
She plans to select a number of people who have the condition, and to divide her test subjects into two groups. The members of one group will receive the medicated lotion to use, and the members of the other group will receive a lotion that looks and feels the same but doesn't contain any medication.
At the end of the study she will compare the two groups to see whether each subject's skin condition has improved, stayed the same, or gotten worse during the time of the study.
(a) State which of the doctor's groups is the control group, and which is the test group.
The test group is the group receiving the medicated lotion, and the control group is the group receiving the unmedicated lotion
(b) Explain how the doctor should choose which test subjects should be in which group.
Participants should be selected randomly for the groups in a control group experiment
She should use random selection
(c) Describe how the doctor could use matched pairs in her study, and explain how this could make the study results more reliable.
Matched pairs pair together people with similar characteristics
This is to control as many extraneous variables as possible
She could pair each person in the control group with a person in the test group who is of the same age and gender
This would help control extraneous variables, by making sure the only difference between people in the two groups is whether or not they use the medicated lotion
Last updated:
You've read 0 of your 10 free revision notes
Unlock more, it's free!
Did this page help you?