Big data fundamentals

Essential Concepts and Tools

Originally created by Darrell Aucoin for a Big data talk at uWaterloo’s Stats Club

In pioneer days they used oxen for heavy pulling, and when one ox couldn’t budge a log, they didn’t try to grow a larger ox. We shouldn’t be trying for bigger computers, but for more systems of computers.
—Grace Hoppe

What is Big Data?

The 4 V’s of Big Data

Volume: The quantity of data.

Usually too big to fit into the memory of a single machine.

Veracity: The quality of data can vary. The inconsistency of data.

Variety: Data often comes in a variety of formats and sources often needing to be combined for a data analysis.

Velocity: The speed of generation of new data.

Volume

100 terabytes of data are uploaded to Facebook daily.
90% of all data ever created was generated in the past 2 years.

Problem: Impossible to do data analysis with one computer on data this size.

Solution: A distributed computing system.

alt text

Veracity

Big Data sets do not have the controls of regular studies (Naming inconsistency , Inconsistency in signal strength)
Cannot simply assume data missing at random

Veracity naming inconsistency: a musician named several different ways in several different files

Variety

Most of Big Data is unstructured or semi-structured data (Doesn’t have the guarantees of SQL)
Data can be structured, semi-structured, or unstructured
Often have to combine various datasets from a variety of sources and formats

Velocity

Speed that data is created, stored, analyzed to create actionable intelligence
Every min:
100 hours is uploaded to Youtube
200 million emails are sent
20 million photos are viewed

Often need to be very agile in creating a data product