Subject 2. Data Preparation and Wrangling PDF Download
Data preparation and wrangling involve cleansing and organizing data into a consolidated format.
Structured DataFor structured data. data preparation and wrangling entail data cleansing and data preprocessing.
Data cleansing is the process of detecting and correcting (or removing) corrupt or inaccurate data. This process is crucial and emphasized because wrong data can drive a business to wrong decisions, conclusions, and poor analysis.
Typical errors are:
- Incompleteness. required data is missing.
- Invalidity. e.g. numbers or dates should fall within a certain range.
- Inaccuracy. the degree to which the data is close to the true values.
- Inconsistency. e.g. a valid age of 10 but with 'divorced' marital status.
- Non-uniformity. e.g. the currency is sometimes in USD and sometimes in YEN.
- Duplication. the duplicate entries should be removed.
Data preprocessing typically involves transformations and scaling of the (cleansed) data.
- Extraction. build derived values intended to be informative and non-redundant.
- Aggregation. e.g. combine two or more attributes into a single attribute. e.g. cities aggregated into regions, states, countries etc.
- Filtration. identify and remove noise in the dataset.
- Selection. select a subset of relevant attributes for use in the model.
- Conversion. convert attributes into different data types.
Scaling: adjusting the range of a feature by shifting and changing the scale of data. Outliers should be removed before performing scaling. Two common ways of scaling are normalization and standardization.
Unstructured DataUnstructured data must be transformed into structured data to be processed by computers. Text data is used in the reading to illustrate the cleansing and preprocessing of unstructured data.
Text cleansing is to clean the text data. It typically involves removing html tags, punctuation, most numbers and white spaces.
Text wrangling (preprocessing) can be essential in making sure you have the best data to work with. it requires performing normalization and involves the following:
- removing stop words such as "the" and "a" because of their many occurrences.
- stemming: cutting down a token to its root stem.
- lemmatization takes into account context and part of speech to determine the lemma, or the root form of the word.
A bag-of-words (BOW) is created after the data cleansed text is normalized. A text is then represented as the collection of its words. An n-grams model can be used to capture the sequence of words.
The bag-of-word model can be viewed as a special case of the n-gram model, with n = 1.
A term document matrix is a way of representing the words in the text as a table (or matrix) of numbers. The rows of the matrix represent the text responses to be analyzed, and the columns of the matrix represent the words from the text that are to be used in the analysis. At this point the unstructured data are converted to structured data.
User Contributed Comments 0
You need to log in first to add your comment.