Big data fundamentals

Essential Concepts and Tools

Originally created by Darrell Aucoin for a Big data talk at uWaterloo’s Stats Club

In pioneer days they used oxen for heavy pulling, and when one ox couldn’t budge a log, they didn’t try to grow a larger ox. We shouldn’t be trying for bigger computers, but for more systems of computers.
—Grace Hoppe

What is Big Data?

The 4 V’s of Big Data

Volume: The quantity of data.

Veracity: The quality of data can vary. The inconsistency of data.

Variety: Data often comes in a variety of formats and sources often needing to be combined for a data analysis.

Velocity: The speed of generation of new data.


Problem: Impossible to do data analysis with one computer on data this size.

Solution: A distributed computing system.

Veracity naming inconsistency: a musician named several different ways in several different files



Often need to be very agile in creating a data product

