Big Data: How Much Data Is Big Data?

Here’s how much data big data actually is:

In simplest terms, big data is any sum of data that is too big for your current systems to handle. 

If you have to invest in more or better computers, it’s big data. 

If you have to change how you deal with your data, it’s definitely big data. 

If you want a number, big data usually starts at about 1 terabyte.

So if you want to learn all about how much data it takes exactly to qualify as big data, then you’re in the right place.

Let’s get started!

What Is Big Data?

Before we get into the technical definitions, let’s talk about the general concept of big data. 

You’ve probably heard the term more than a few times. 

It’s in articles for businesses all over the world. 

Just browse Forbes or BusinessInsider for a few minutes, and you’ll probably see more than a few mentions of Big Data, the Internet of Things, advanced analytics, and plenty of related topics.

What is it all really talking about? 

As the name suggests, big data is all about the collection and processing of data in very large volumes.

Generally speaking, these are data volumes that are too large to handle with a single personal computer. 

Big data also usually involves data generation from multiple locations and resources.

There’s a point to all of this. 

The idea is that when you have sufficiently large pools of information, the trends you observe are more meaningful and reliable.

Breaking Down the Idea of Large Averages

This gets into the idea of very large averages

Having a lot of data introduces the idea of inertia into your data and analytics.

What does that mean?

Think about averaging grades for a moment. 

Whether school was a long time ago or you’re a current student, it’s pretty normal for a class to average a bunch of different grades to give you an average for the course. 

So, if a college class has three test grades and nothing else, each individual grade has a huge impact on your final average. 

Doing well or poorly on just one test can completely swing your final grade.

On the other hand, if you have a homework assignment every single night, then over the course of a semester, no one homework grade is going to matter that much. 

The average has more individual data points in it, so it’s not as heavily swayed by a single grade.

Which grading system better reflects your performance in the class? 

Proponents of big data would say that the stabler average is the better one.

Now, extend this idea to large businesses. 

Imagine how Google might use data and averages to figure out how to rank searches. 

Billions of searches run through Google every day, so no one search is going to completely change how search rankings work. 

That provides a lot of stability to the number crunching, but it comes at a cost. 

How do you process the billions of searches that go through Google every day?

That’s the entire concept of big data. 

It’s figuring out how to collect and process more data so your averages are more stable and reliable.

How Does Big Data Work? (3 Points)

The ultimate answer to how much data is big data might make sense with a deeper working understanding of big data itself. 

For that, we can break big data processes into three pillars: collection, storage, and processing.

#1 Collecting Data

The first of the three pillars is collecting data. 

In order to have big data, you need a lot of numbers (or data points) in the first place. 

There are a lot of ways to collect information, but a couple are easy to understand.

For any business, transactions are pretty normal. 

Whether your business is Walmart selling countless goods at thousands of locations every day or a law firm billing clients each month, money changes hands. 

Most businesses try to keep good track of the money they make and spend, so this is an easy way to get data. 

You can create a transaction receipt every time money changes hands, and there are a lot of modern systems that automate this process.

Another easy way to generate data is with websites.

Every time someone visits your site, they interact with it. 

Computers can track what they do, and that generates tons of data.

Ultimately, data collection is only limited by creativity, but until you have the infrastructure in place to collect data, the rest is meaningless.

#2 Storing Data

Once you have a lot of data, it has to be stored somewhere. 

Since we’re talking about big data, it’s unlikely that you can store everything in a physical filing cabinet or even a single personal computer.

Big data usually involves the use of servers. 

Ultimately, servers are powerful computer systems that are designed to handle much larger amounts of data and processing than personal devices.

So, most players in big data either build big servers or contract tech companies to manage servers for them. 

You’ll hear terms like “the cloud” thrown around. 

Ultimately, cloud services are a way to outsource server management, so it all boils down to the same root concept.

You need access to powerful servers to store your big data.

#3 Processing Data

Lastly, big data is meaningless unless you analyze it. 

Running so much information through calculations and algorithms is a challenge, so you typically need powerful processing resources to analyze your big data.

Once again, servers do most of the work. 

As I just said, servers can handle much larger processing loads than personal computers. 

That means they can run way more calculations than your smartphone or laptop, and that helps them sort through the huge data stores we’re discussing today.

When data gets big enough, you might even use multiple servers or server groups to go through it all. 

You can trust that a company like Google is expending so many resources on data that it can’t even fit in a single warehouse. 

The company currently has 23 data center locations

Each location is filled with more processing power than really makes conventional sense.

Here’s my attempt to try to put this in perspective. 

It takes billions of gallons of water just to cool off the computers in these data centers each year. 

Needless to say, the power necessary to run the world’s most-used search engine is outright ridiculous.

How Much Data Does it Take to Qualify As Big Data?

Ok. 

Now that you have a better picture of what is involved with big data, let’s return to the original question.

How much data is big data?

If you ask a hundred tech experts, you might get a hundred answers.

I want to focus on just two.

The first comes from Przemek Chojecki, an Oxford Ph.D. and computer science expert. 

According to Chojecki, big data refers to any “dataset which is too large or complex for ordinary computing devices to process.” 

So, that would mean that the amount of data needed to qualify as big data changes as computers get more powerful and sophisticated.

Using this definition, by today’s standards, big data starts to kick in when it takes up more than a terabyte of storage space (I’ll get into this in a minute).

The other definition, which I can’t attribute to a single expert, is that big data applies to any situation that requires innovative solutions in order to handle it all. 

So, if you can’t process your data with the tools you already have, then you’re dealing with big data.

Both of these ideas make a lot of sense. 

If your computer (or computers) on hand can’t handle the data, then it’s big. That’s pretty easy, right?

But to clarify, we should probably explore a few more ideas. 

First, I’m going to explain data sizes a little more closely.

Understanding Data Sizes

If terabytes of data are what qualify as big data, then what is a terabyte? 

Well, it’s a unit of measurement for computer information. 

At the basic level, computers store information in bits

A bit is a set of ones and zeros that represents a single piece of information for a computer.

So, if you’re tracking transactions, a single sale might be stored as one bit.

But, as information gets more complicated, bits don’t really do the job anymore. 

They’re still the basic building block, but you put them together to form bytes. 

More specifically, a byte is made of eight bits, so a byte can contain a lot more data than just a bit.

Still, we’re talking about massive amounts of data here, so even a single byte doesn’t come anywhere near what you need to process big data. 

Instead, that is measured in terabytes (or even substantially larger units).

To keep it simple, a terabyte is 1 trillion bytes. 

That’s a lot of bytes, but without context, it doesn’t mean much.

You can look at it this way. 

If you have ever watched Netflix, then you’ve streamed a lot of bytes in order to view a single video. 

If you’re watching at 1080p (standard high definition), then an hour of video uses up about 3 gigabytes of data. 

At ultra high definition (4k), one hour of video is about 7 gigabytes.

A terabyte is 1 thousand gigabytes, so you’re looking at over 300 hours of high-definition Netflix streaming before you hit one of our definitions of big data.

Hopefully, that helps put it in perspective.

What Kinds of Innovations Run Big Data? (3 Things)

The second definition of big data is interesting because it forces us to look at how big data is changing the world. 

Since big data necessitates innovation, what innovations can we already see? 

I’m going to take you through three big ones. 

When we’re done, you’ll hopefully have a good idea of what it means to have so much data that it requires innovation.

#1 Artificial Intelligence

When you think about the mind-boggling amounts of data being processed, it’s clearly too much for people to do by hand. 

It’s actually too much for normal computers to do, hence our definition that says big data requires innovation.

In order to process tons of data, one of the most useful innovations is artificial intelligence. 

In particular, machine learning gets better and more accurate when it has access to more data. 

Basically, machine learning uses extremely complicated math formulas to sift through the massive piles of data we’re discussing.

With those formulas, it can simplify analysis and produce meaningful extrapolations much faster than other analytical techniques. 

The price of this is that normal computers can’t handle advanced machine learning.

It requires too much processing power.

But when you do solve the processing problem, artificial intelligence helps to sort through big data with a lot less human oversight.

#2 Decentralized Processing

If big data is too much for a single computer, then it stands to reason you could sort through it all with lots of computers, right?

That’s the concept of decentralized processing. 

This is a bit of an oversimplification, but the gist is that you can store all of the data on a server somewhere. 

You can then give a whole bunch of devices access to the data. 

Each device contributes what it can, and with enough devices, you can analyze even these huge piles of information.

A good example of this is blockchain. 

Blockchain requires huge amounts of calculations to work. 

Instead of making a supercomputer to do it all, blockchain instead allows anyone who wants to contribute to the calculations. 

With enough participants, you can churn through the calculations, and the system works.

#3 Internet of Things

Another interesting innovation with big data is the Internet of Things. 

This is a catchphrase that describes systems that are designed to collect tons of data. 

So, with the internet of things, you might put internet-enabled sensors in refrigerators. 

These sensors then report back to a central server how the refrigerators perform. 

The manufacturer can look at that data and get an idea as to what design changes they might need to make to improve on the next model.

It’s a specific example, but the idea is that with lots and lots of internet-enabled sensors, you can generate data for just about anything you want to analyze.