2026 CFA Level II Exam: CFA Study Preparation

CFA Exams
2026 Level II
Topic 1. Quantitative Methods
Learning Module 7. Big Data Projects
Subject 2. Data Preparation and Wrangling

Why should I choose AnalystNotes?

Simply put: AnalystNotes offers the best value and the best product available to help you pass your exams.

Subject 2. Data Preparation and Wrangling PDF Download

Data preparation and wrangling involve cleansing and organizing data into a consolidated format.

Structured Data

For structured data. data preparation and wrangling entail data cleansing and data preprocessing.

Data cleansing is the process of detecting and correcting (or removing) corrupt or inaccurate data. This process is crucial and emphasized because wrong data can drive a business to wrong decisions, conclusions, and poor analysis.

Typical errors are:

Incompleteness. required data is missing.
Invalidity. e.g. numbers or dates should fall within a certain range.
Inaccuracy. the degree to which the data is close to the true values.
Inconsistency. e.g. a valid age of 10 but with 'divorced' marital status.
Non-uniformity. e.g. the currency is sometimes in USD and sometimes in YEN.
Duplication. the duplicate entries should be removed.

Data preprocessing typically involves transformations and scaling of the (cleansed) data.

Transformations:

Extraction. build derived values intended to be informative and non-redundant.
Aggregation. e.g. combine two or more attributes into a single attribute. e.g. cities aggregated into regions, states, countries etc.
Filtration. identify and remove noise in the dataset.
Selection. select a subset of relevant attributes for use in the model.
Conversion. convert attributes into different data types.

Scaling: adjusting the range of a feature by shifting and changing the scale of data. Outliers should be removed before performing scaling. Two common ways of scaling are normalization and standardization.

Unstructured Data

Unstructured data must be transformed into structured data to be processed by computers. Text data is used in the reading to illustrate the cleansing and preprocessing of unstructured data.

Text cleansing is to clean the text data. It typically involves removing html tags, punctuation, most numbers and white spaces.

Text wrangling (preprocessing) can be essential in making sure you have the best data to work with. it requires performing normalization and involves the following:

lowercasing
removing stop words such as "the" and "a" because of their many occurrences.
stemming: cutting down a token to its root stem.
lemmatization takes into account context and part of speech to determine the lemma, or the root form of the word.

A bag-of-words (BOW) is created after the data cleansed text is normalized. A text is then represented as the collection of its words. An n-grams model can be used to capture the sequence of words.

The bag-of-word model can be viewed as a special case of the n-gram model, with n = 1.

A term document matrix is a way of representing the words in the text as a table (or matrix) of numbers. The rows of the matrix represent the text responses to be analyzed, and the columns of the matrix represent the words from the text that are to be used in the analysis. At this point the unstructured data are converted to structured data.

LOS Quiz

User Contributed Comments 0

You need to log in first to add your comment.

You have a wonderful website and definitely should take some credit for your members' outstanding grades.

Colin Sampaleanu

My Own Flashcard

No flashcard found. Add a private flashcard for the subject.

Add

Actions

Take a Quiz
PDF Download
Previous LOS
Next LOS
Print notes
Mark as complete
Bookmark this LOS
Add my flashcard