So what are the major issues with big data? Storage, management, and analysis. “Big Data results in three basic challenges: storing, processing and managing it efficiently. Scale-out architectures have been developed to store large amounts of data and purpose-built appliances have improved the processing capability. The next frontier is learning how to manage Big Data throughout its entire lifecycle.” 1
The first problem is storage. With such large volumes, companies are looking for alternatives to store all the data they collect. Off-site data warehouses and scale-out architectures have been used and advanced to accommodate greater storage capability. However, many firms jump past the first step of reducing redundant data. As multiple departments use, process, and add to data, redundancies occur at an alarming rate. The first step to managing large data is to reduce it down to its unique set, for easier storage. “Data redundancy is costly to address as it requires additional storage, synchronization between databases, and design work to align the information represented by different presentation of the same data.”2
Next to better manage this unique set of data, firms should explore virtualization technology and applications. Virtualization will allow multiple users and interfaces to access and reuse the same data, which is stored on one independent device. What exactly is virtualization? “Virtualization, in computing, is the creation of a virtual (rather than actual) version of something, such as a hardware platform, operating system (OS), storage device, or network resources” 3. Virtualization allows for multiple users and devices to add to or modify one existing database, thus minimizing the challenges and costs created by unnecessary data redundancy. This will result it better management going forward, to ensure the unique data set stays at a manageable, and useful size.
If you're interested in virtualization and virtualization software/systems, there are a lot of great sources and companies our there. One of the leaders right now is VMware. Check out their site, its well worth your time. http://www.vmware.com
Once data has been reduced to a unique set and measures have been taken to insure that large sets of redundant data don’t evolve and persist, productive analysis can be executed which can lead to real business intelligence. Web analytics opportunities can truly be wasted if these first steps are not taken.
It’s important not to confuse purposeful redundancy and wasteful redundancy. “Some IT redundancy is a good thing. For example, during a power outage when one of your data centers is not operational, you need a backup. The discussion here focuses on needless IT redundancy, or IT redundancy that only exists because of insufficient management of the IT systems.”4 It is encouraged to back up and protect your data in case of a catastrophic failure somewhere in your information infrastructure. However, wasteful redundancy can simply multiply data needlessly, increase costs, and skew analytical results, thus decreasing the value of the business intelligence generated.
1: Savits, Eric. “Best Practices for Managing Big Data”. Forbes July, 5, 2012
2: Redundant Data. www.learn.geekinterview.com published December 9, 2009
4. Managing Meta Data for the Business. http://www.eiminstitute.org/library/eimi-archives/volume-1-issue-10-december-2007-edition/managing-meta-data-for-the-business-reducing-it-redundancy-part-2-of-5. published December 2010