To Big or Not To Big Data

Hype

Big Data today is not as sexy as it was. Now that the hype is gone … what Big Data actually is and what it isn’t? Let’s start with a bit of history.

Short Big Data History

2003 Distributed Storage: Google File System
2004 Distributed Computation: MapReduce: Simplified Data Processing on Large Clusters
2006 Distributed Storage & Computation Implementation: Hadoop
2006 Distributed Random Access: Bigtable: A Distributed Storage System for Structured Data
2007 Distributed Random Access Implementation HBase
2008 - 2012 1st gen. Horizontal Development (Hadoop Maturity, Ecosystem Enrichment)
2012 Distributed Data Structures Resilient Distributed Datasets: a fault-tolerant abstraction for in-memory cluster computing
2014 Spark becomes Apache Spark
2014 - 2017 2nd gen. Horizontal Development (Spark Maturity, phasing-out Map-Reduce)

What did Big Data promise in the first place?

Big Data appeared in the dark years of data swamps generated by the exponential growth of the web, when nearly everyone started to get connected to this global network called the Internet; not only the people got connected, but also anything from monsters like huge global corporations to tiny glow worms like Raspberries started to shout from the most obscure parts of the Internet.

In this madness, Big Data promised enlightenment: a set of tools and techniques to help you get insights over large volumes of data - and this is what we got in nearly 15 years! Now it’s so easy to query terabytes of data in plain SQL and you might not even know that underneath it’s all Big Data.

In a way Big Data lost its hype by making itself too accessible.

What Big Data is not?

Big Data is not the ultimate Machine Learning tool. Even if the Big Data techniques and tools can be used to implement Machine Learning tasks, when you run to Deep Learning don’t bully Big Data.

Don’t forget that the same Jeff Dean that contributed to the initial Big Data revolution (see MapReduce and BigTable authors), made also Deep Learning scalable through TensorFlow with ideas that are not too different of what we find in this granny called Big Data (execution transparency, lazy definitions of data transformations, in-memory computations).

From a more general note, the Big Data world will help you answer your questions but it will not be able to extend your perception as the Neural Networks are doing now.

The future?

How would the 3rd gen. of the Big Data world look? Will there be one?

Checkout Ion Stoica’s talk about the future of Big Data at Berkeley: RISELab: Enabling Intelligent Real-Time Decisions