In a 2001 research report by META Group, Doug Laney laid the seeds of Big Data and defined data growth challenges and opportunities in a “3Vs” model. The elements of this 3Vs model include volume (the sheer, massive amount of data or the “Big” in Big Data), velocity (speed of data processed) and variety (breadth of data types and sources). Roger Magoulas of O’Reilly media popularized the term “Big Data” in 2005 by describing these challenges and opportunities. Presently Gartner defines Big Data as “high-volume, high-velocity and high-variety information assets that demand cost-effective, innovative forms of information processing for enhanced insight and decision making.” Most recently IBM has added a fourth “V,” Veracity, as an “indication of data integrity and the ability for an organization to trust the data and be able to confidently use it to make crucial decisions.”
The volume of data being created in our world today is exploding exponentially. McKinsey’s 2012 paper “Big data: The next frontier for innovation, competition, and productivity” noted that:
• to buy a disk drive that can store all of the world’s music costs $600
• there were 5 billion mobile phones in use in 2010
• over 30 billion pieces of content shared on Facebook every month
• the projected growth in global data generated per year is 40% vs. a 5% growth in global IT spending
• 235 terabytes data was collected by the US Library of Congress by April 2011
• 15 out of 17 sectors in the United States have more data stored per company than the US Library of Congress
IBM has estimated that “Every day, we create 2.5 quintillion bytes (5 Exabyte) of data — so much that 90% of the data in the world today has been created in the last two years alone. In their book “Big Data, A Revolution That Will Transform How We Live, Work, And Think,” Viktor Mayer-Schonberger and Kenneth Cukier state that “In 2013 the amount of stored information in the world is estimated to be around 1,200 Exabytes, of which less than 2 percent is non-digital.” They describe an Exabyte of data if placed on CD-ROMs and stacked up, they would stretch to the moon in five separate piles.
This sheer volume of data presents huge challenges. For time-sensitive processes such as fraud detection, a quick response is critical. How does one find the signal in all that noise? The variety of both structured and unstructured data is ever expanding in forms: numeric file, text documents, audio, video, etc. And last, in a world where 1 in 3 business leaders lack trust in the information they use to make decisions, data veracity is a barrier to taking action.
The solution lays ever more inexpensive and accessible processing power and the nascent science of machine learning. While Abraham Kaplan (1964) principle of the drunkard’s search holds true: “There is the story of a drunkard, searching under a lamp for his house key, which he dropped some distance away. Asked why he didn’t look where he dropped it, he replied ‘It’s lighter here!’” A massive dataset that all has the same bias as a small dataset will only give you a more precise validate of a flawed answer, we are still in early days. Big Data is the opportunity to unlock answers to previously unanswerable questions and to uncover insights unseen. With it are new dangers as the NSA warrantless surveillance controversy clearly exposes.
I have had the privilege of listening to Clayton Christensen speak several times. In particular he has one common through line that stuck with me and forever embedded itself in my consciousness. “I don’t have an opinion. But I have a theory, and I think my theory has an opinion.” I believe the same for Big Data. The data has an opinion, the data has the answers.