Wednesday, February 13, 2013

Introduction to Hadoop

What is Hadoop?                                                                     

If we have a look at the definition: Apache Hadoop is an open-source software framework that supports data-intensive distributed applications, licensed under the Apache v2 license.[1]
In general terms, Hadoop is a software which helps you to derive the information from both structured and unstructured data. Consider the following scenario: There is one gigabyte of data that you need to process and the data is stored in your desktop. You might not have problem in storing and processing this data. Then your company starts to grow very quickly and the data increases to 100 gigabytes.You start to reach the limits of your computer. So you scale up by investing in a larger computer. Again, your data grows and this time it reaches 100 terabytes, you again reached the limits of your computer. Moreover, you are now asked to feed you application with unstructured data coming from sources like facebook., twitter, etc. What will you do now? This is very Hadoop can help.

Hadoop is an open source project of the Apache Foundation. It is a framework written in Java originally developed by Doug Cutting who named it after his son's toy elephant. Its uses Google's map reduce and files system technologies as its foundation.

What Hadoop is for?

Hadoop is optimized to handle massive quantities of data which could be structured, unstructured and semi-structured using commodity hardware that is relatively inexpensive computers. That massive processing is done with great performance, however, it is batch operation which handles massive quantities of data, so the response time is not immediate.

Hadoop can analyze and access the unprecedented volumes of data churned out by the Internet in much more easier and cheaper way. It can map information spread across thousands of cheap computers and can create an easier means for writing analytical queries. Engineers no longer have to solve a grand computer science challenge every time they want to dig into data. Instead, they simply ask a question.

Few examples were Hadoop can help:
Hadoop applies to a bunch of markets. In finance, if you want to do accurate portfolio evaluation and risk analysis, you can build sophisticated models that are hard to jam into a database engine. But Hadoop can handle it. In online retail, if you want to deliver better search answers to your customers so they’re more likely to buy the thing you show them, that sort of problem is well addressed by the platform Google built.[2]

Hadoop Architecture


Hadoop has two main sub-projects: MapReduce and HDFS (Hadoop distributed file system)

MapReduce: The MapReduce algorithm is a two-stage method for processing a large amount of data. This process is abstract enough that it can accommodate many different data-processing tasks, i.e., almost any operation one might want to perform on a large set of data can be adapted to the MapReduce framework. The advantage of the MapReduce abstraction is that it permits a user to design his or her data-processing operations without concern for the inherent parallel or distributed nature of the cluster itself.

HDFS: Hadoop Distributed File System spends all the nodes in hadoop cluster for data storage. HDFC gathers the file system on many local nodes to make them into one big file system. It assumes that nodes would fail, so it achieves relaiability by replicating data across multiple nodes.

Architecturally, the reason one is able to deal with lots of data is because Hadoop spreads it out. And the reason one is able to ask complicated computational questions is because it has got all of these processors, working in parallel, harnessed together.


Four great characteristics of Hadoop:


1. Scalable: Hadoop is scalable. New nodes can be added as needed without changing the data format, data loading methodology and applications on top.
2. Cost effective: Hadoop brings massive parallel computing to large clusters of regular servers. The result is a sizeable decrease in cost per terabyte of storage which in turn makes analyzing all of the data affordable.
3. Flexible: It can absorb any type of data structured or not from any number of sources. Data from multiple sources can be joined and aggregated in arbitrary way enabling deeper analysis than other systems can provide.
4. Fault Tolerant: When a node is lost the system redirects work to another location in the system and continues processing without missing anything. All this happens without programmers writing special codes. Even the knowledge of parallel processing infrastructure is not required.


What Hadoop is not for?


Hadoop is not suitable for online transaction processing workloads where data are randomly accessed on structured data like a relational database. It is not suitable for online analytical processing or decision support system workloads where data are randomly accessed on structured data like a relational database to generate report that provide Business Intelligence. Hadoop is used for Big Data. It complements online transaction processing and online analytical processing but it is not a replacement for relational database system.

Hadoop is not something which can solve all the application/datacentre problems. It solves some specific problems for some companies and organisations, but only after they have understood the technology and where it is appropriate. If you start using Hadoop in the belief it is a drop-in replacement for your database or SAN filesystem, you will be disappointed.


Reccomendations by Ted Dunning: How to start using Hadoop [3]

“The best way to do it is to get in touch with the communities, and the best way always is in person. If you live near a big city there is almost certainly a big data or Hadoop MeetUp that meets roughly monthly. And then there are specialized things like data mining groups that also meet roughly monthly. If you could get to one of those and talk to people there, then you start to hear what they say how it works, and you can see how you can try these things out. Amazon makes available publicly a number of completely free resources of large data, they have large sets of genomes, they have US Census data, Apache archives for email from years and years of sample data sets that you can try all kinds of experiments on and I hate to be recommending of any commercial service but Amazon makes available Map reduce, Hadoop in small dozes with their EMR product so you can write and run small EMR programs against publically available data really quite easily.

You can then also run a Hadoop cluster. You can run it in the cloud small clusters, it just takes a few dollars to run a significant amount of hardware for a reasonable amount of time, ten hours, two hours, whatever it takes to learn that. You can also then run on cast off hardware and that’s how I started .We had ten machines at work that were so unreliable, that nobody would use them, and so they were fine for me to experiment with. These sorts of resources are often available if you don’t want to spend a little bit of money for the EC2 instances. So that’s the implementation of the very simple specification of just do it: find people, find mailing lists, and try out on real data. Those people will give you pointers and I think I would be happy to help people as well.”