Tuesday, February 18, 2014

In the Era of Big Data - Sampling VS Collecting Everything

The Conundrum - Data Everywhere


We live in a world where much of what we do is tracked by both governments and companies. When online, as you browse the web, various metrics are tracked for every web page you view (including this one). In the outside world cell phones can provide companies your location; credit cards, bank accounts, and frequent shopper cards show them your spending habits. 

With enough data it is possible find correlation trends between seemingly unrelated topics, such as web searches and illness. Below is a youtube video showing how google uses search data to predict flu outbreaks in near real-time.

In the past while it was possible to gather huge amounts of data, analysis was difficult and generally required sampling. Basically, a selection of the available data would be collected and analyzed to form assumptions about the overall population. Now there are new tools for analysis that allow for true population analysis. So should we look to abandon sampling in favor of simply collecting and analyzing everything? What does using more data gain us, and at what cost?

 Collecting Everything - The NSA


The NSA provides a very high profile example of an organization compiling all available data (though there are unknowns as to how exactly it analyzes what it gets).  In a recent Forbes article, Gil Press compiles data from various sources in an attempt to decipher whether or not this results in better results than simply sampling from the population.

Beyond the question of if the NSA should collect the data, is the question of whether the glut of data is unmanageable.
NSA Datacenter Image - AP PHOTO/GOOGLE, CONNIE ZHOU
"The unspoken assumption here is that possessing massive quantities of data guarantees that the government will be able to find criminals, and find them quickly, by tracing their electronic tracks. That assumption is unrealistic. Massive quantities of data add cost and complexity to every kind of analysis, often with no meaningful improvement in the results. Indeed, data quality problems and slow data processing are almost certain to arise, actually hindering the work of data analysts. It is far more productive to invest resources into thoughtful analysis of modest quantities of good quality, relevant data.”

On the other hand, there is a concern that sampling will miss outliers, which is really what you want in this sort of an analysis. In a discussion post regarding big data and sampling, Paige Roberts had this to say:

"When finding outliers is the goal, sampling is counter-productive. When finding trends in the overall data is the goal, then sampling is a shortcut that may or may not do the job. But one that has become standard practice because up until now, we didn't have the data crunching power to do anything else in a sensible time frame."

So, Should We Sample or Not!?!?


It may seem unsatisfying, but in the end the answer to the question of whether or not to sample appears to be "it depends". If the sampling is done in a statistically valid way, way sampling, especially for common trends, seems to be a very cost effective method to get what we want.  If we're looking for specific outliers (as opposed to just finding about how many there would be in a given sample) however, there might not be a good alternative to combing through much more data. Even in cases where outliers are being sought out it is important to note that the amount of the data in the population and the amount of processing power and tools being used to analyze have to align. A thorough reading of the NSA case shows the organization might have bit off more than it can properly chew.

So, should you sample? I'll leave that decision up to you.

__________________________

References


In the order they appear in-article:

- http://www.forbes.com/sites/gilpress/2013/06/12/the-effectiveness-of-small-vs-big-data-is-where-the-nsa-debate-should-start/

- http://global.fncstatic.com/static/managed/img/Scitech/NSA%20Phone%20Records%202.jpg / http://www.foxnews.com/tech/2013/06/11/inside-nsas-secret-utah-data-center/

- http://www.techrepublic.com/blog/big-data-analytics/why-samples-sizes-are-key-to-predictive-data-analytics/

- http://smartdatacollective.com/users/paige-roberts