AP®Computer ScienceCollege BoardComputer Science PrinciplesStudy GuidesUnit 2: DataExtracting Information from DataChallenges associated with processing data

Challenges associated with processing data (College Board AP® Computer Science Principles): Study Guide

Written by: Robert Hampton

Reviewed by: James Woodhouse

Updated on 23 April 2026

Data quality & preparation

What challenges arise when processing data?

The ability to process data depends on the capabilities of the users and their tools
Datasets pose challenges regardless of size, such as:
- The need to clean data: making it uniform so it can be analyzed reliably
- Incomplete data: missing values or gaps in the dataset
- Invalid data: values stored in the wrong format, or values that don't match the expected type
- The need to combine data sources: merging datasets that may have different formats, structures, or meanings

Cleaning data

Data may not be uniform because of how it was collected, for example, users entering data into an open field may abbreviate, spell, or capitalize things differently
Cleaning data is the process of making data uniform without changing its meaning, such as replacing all equivalent abbreviations, spellings, and capitalizations with the same word

Bias in data

Problems of bias are often created by the type or source of data being collected
Bias is not eliminated by simply collecting more data, a biased collection method produces biased results at any scale

Large data processing requirements

What makes large datasets difficult to process?

As data set size grows, the resources required to store, process, and analyze the data increase significantly
Standard tools and hardware may lack the capabilities needed to process very large datasets in a reasonable time
Large data sets can exceed the memory or processing capacity of a single computer
Information extraction from large datasets requires efficient algorithms and powerful computing infrastructure

Resource requirements

Processing large datasets demands greater computing power, memory, and storage than smaller datasets
The time required to process data scales with dataset size: an operation that takes seconds on a small dataset may take hours on a large one
Organizations working with large data must invest in appropriate hardware, cloud services, or distributed computing systems to meet these requirements

Scalability & parallel processing

How is large data processing managed?

The size of a dataset affects the amount of information that can be extracted from it, larger datasets can reveal more patterns, but only if the processing tools can handle them
Scalability is the ability of a system to handle increasing amounts of data or workload without a significant loss of performance
A scalable system can be expanded by adding more hardware or software resources to meet growing demands
Parallel systems process large datasets by dividing the work across multiple processors or computers that operate simultaneously
Parallel processing reduces the time needed to analyze large datasets by performing many operations at once rather than one at a time
Tasks that would take impractically long on a single processor can be completed in a fraction of the time using parallel systems
Many modern data analysis tools and cloud platforms use parallel processing by default to handle large-scale data efficiently

Examiner Tips and Tricks

When the AP exam describes a situation where a dataset is too large for a single computer or takes too long to process, the solution is parallel processing or scalability. Data quality questions often describe scenarios where two datasets are combined and produce unexpected results — this usually points to the need to clean data or the need to combine data sources. Always check whether the problem is about the size of the data or the quality of the data, as these require different solutions.
For the AP Create Performance Task, if your program uses data from an external source, be prepared to explain on exam day how you ensured the data was suitable for your program's purpose — consider whether the data was complete, consistent, and free from bias

Worked Example

A research team combines data from two hospitals to analyze patient outcomes. They discover that one hospital records patient age as a whole number (e.g. 45) while the other records it as a range (e.g. 40–50).

Which of the following best describes this data challenge?

(A) A scalability issue, because the combined dataset is too large to process

(B) A bias issue, because one hospital has more patients than the other

(C) The need to clean data, because the two datasets are not uniform and must be made consistent before analysis

(D) A safety issue, because patient data should not be combined across hospitals

[1]

Answer:

(C) The need to clean data, because the two datasets are not uniform and must be made consistent before analysis [1 mark]

The two datasets record patient age in incompatible formats. This is an example of data that is not uniform, and resolving it is a data cleaning task — making the data uniform without changing its meaning.

Unlock more, it's free!

Join the 100,000+ Students that ❤️ Save My Exams

the (exam) results speak for themselves:

I would just like to say a massive thank you for putting together such a brilliant, easy to use website.I really think using this site helped me secure my top gradesin science and maths. You really did save my exams! Thank you.

Beth
IGCSE Student

This website is soooo useful and I can’t ever thank you enough for organising questions by topic like this. Furthermore, the name of the website could not have been more appropriate as it literally did SAVE MY EXAMS!

Fathima
A Level Student

Incredible! SO worth my money, the revision notes have everything I need to know and are so easy to understand. I actually enjoy revising! It makes me feel a lot more confident for my GCSEs in a few months.

Kate
GCSE Student

Absolutely brilliant, both my girls used it for A levels and GCSE. It's saves on paper copies, also beneficial exam questions ranked from easy to hard. It's removed a lot of stress from the exams.

Sameera
Parent

Just to say that your resources are the best I have seen and I have been teaching chemistry at different levels for about 40 years

Mark
Chemistry Teacher

Excellent