Assimilating data

Ask > Aquire > Assimilate > Analyze > Answer > Advise > Act


Extract: data is usually stored in a form of a source in a specific type (such as flat files, relational databases, streaming data, XML/JSON files, Open Database Connectivity (ODBC) and/or Java Database Connectivity (JDBC) data sources. Transform: Cleanse, convert, aggregate, merge, and split and modify data so it use useful. Load: Once transformed, it can be loaded into another database or warehouse

Transforming and Normalizing Data

Binning Numerical Data

If numerical data needs to transformed into categorical data, it can be binned into groups

Encoding Categorical Data

If categorical data needs to be transformed into numerical data (for example if a classifier only works with numerical data), it can be encoded. If the data is binary, it can be encoded into 1 or 0. The issue if the data is not binary, is that by adding


Fitting data betwen a specific range (such as 1 or -1)

Format conversion

binary conversion, time conversion,

Coordiate Conversion

geo conversion.

For example gps coordiate vs distance

Fourier Transfer

Convert to frequency domain using Fast Fourier Transform or Discrete wavelet transform

Data Cleaning

Outlier Detection

As part of the data cleaning stage, outliers can be detected through outlier detection. Once outliers are determined, they can be removed and then ignored or imputed. Alternatively there are algorithms which robustly ignore (or deal with) outliers.


Random sampling, distributions, mean, regression



Removing duplicates

Dimensionality Reduction

Principal Compnent analysis

Membership Assignment

First use clustering (X means, canopy clustering) to determine the cluster, than assign a membership.

Feature hashing

Text data: Term Frequecy Inverse Document Frequency

Data enrichment

In addition to cleaning data, data can be enriched by additional sources of data. For example, features can be added based on a specific feature and an additional dataset. For example geography or weather can be added to a dataset to predict if someone is going to take a taxi, bike or walk.

Further Reading

Data Preparation

Dealing with outliers

Dimensionality Reduction

A Tutorial on Principal Component Analysis

Princiapl Components Analysis

PCA step by step

Assimilating Data - February 19, 2015 - Andrew Andrade