Challenges associated with processing data (College Board AP® Computer Science Principles): Study Guide
Data quality & preparation
What challenges arise when processing data?
The ability to process data depends on the capabilities of the users and their tools
Datasets pose challenges regardless of size, such as:
The need to clean data: making it uniform so it can be analyzed reliably
Incomplete data: missing values or gaps in the dataset
Invalid data: values stored in the wrong format, or values that don't match the expected type
The need to combine data sources: merging datasets that may have different formats, structures, or meanings
Cleaning data
Data may not be uniform because of how it was collected, for example, users entering data into an open field may abbreviate, spell, or capitalize things differently
Cleaning data is the process of making data uniform without changing its meaning, such as replacing all equivalent abbreviations, spellings, and capitalizations with the same word
Bias in data
Problems of bias are often created by the type or source of data being collected
Bias is not eliminated by simply collecting more data, a biased collection method produces biased results at any scale
Large data processing requirements
What makes large datasets difficult to process?
As data set size grows, the resources required to store, process, and analyze the data increase significantly
Standard tools and hardware may lack the capabilities needed to process very large datasets in a reasonable time
Large data sets can exceed the memory or processing capacity of a single computer
Information extraction from large datasets requires efficient algorithms and powerful computing infrastructure
Resource requirements
Processing large datasets demands greater computing power, memory, and storage than smaller datasets
The time required to process data scales with dataset size: an operation that takes seconds on a small dataset may take hours on a large one
Organizations working with large data must invest in appropriate hardware, cloud services, or distributed computing systems to meet these requirements
Scalability & parallel processing
How is large data processing managed?
The size of a dataset affects the amount of information that can be extracted from it, larger datasets can reveal more patterns, but only if the processing tools can handle them
Scalability is the ability of a system to handle increasing amounts of data or workload without a significant loss of performance
A scalable system can be expanded by adding more hardware or software resources to meet growing demands
Parallel systems process large datasets by dividing the work across multiple processors or computers that operate simultaneously
Parallel processing reduces the time needed to analyze large datasets by performing many operations at once rather than one at a time
Tasks that would take impractically long on a single processor can be completed in a fraction of the time using parallel systems
Many modern data analysis tools and cloud platforms use parallel processing by default to handle large-scale data efficiently
Examiner Tips and Tricks
When the AP exam describes a situation where a dataset is too large for a single computer or takes too long to process, the solution is parallel processing or scalability. Data quality questions often describe scenarios where two datasets are combined and produce unexpected results — this usually points to the need to clean data or the need to combine data sources. Always check whether the problem is about the size of the data or the quality of the data, as these require different solutions.
For the AP Create Performance Task, if your program uses data from an external source, be prepared to explain on exam day how you ensured the data was suitable for your program's purpose — consider whether the data was complete, consistent, and free from bias
Worked Example
A research team combines data from two hospitals to analyze patient outcomes. They discover that one hospital records patient age as a whole number (e.g. 45) while the other records it as a range (e.g. 40–50).
Which of the following best describes this data challenge?
(A) A scalability issue, because the combined dataset is too large to process
(B) A bias issue, because one hospital has more patients than the other
(C) The need to clean data, because the two datasets are not uniform and must be made consistent before analysis
(D) A safety issue, because patient data should not be combined across hospitals
[1]
Answer:
(C) The need to clean data, because the two datasets are not uniform and must be made consistent before analysis [1 mark]
The two datasets record patient age in incompatible formats. This is an example of data that is not uniform, and resolving it is a data cleaning task — making the data uniform without changing its meaning.
Unlock more, it's free!
Was this revision note helpful?