Monday 25 November 2013

Hadoop Basics

What is  Hadoop ?

Hadoop is a paradigm-shifting technology that lets you do things you could not do before – namely compile and analyze vast stores of data that your business has collected. “What would you want to analyze?” you may ask. How about customer click and/or buying patterns? How about buying recommendations? How about personalized ad targeting, or more efficient use of marketing dollars?

From a business perspective, Hadoop is often used to build deeper relationships with external customers, providing them with valuable features like recommendations, fraud detection, and social graph analysis. In-house,Hadoop is used for log analysis, data mining, image processing, extract-transform-load (ETL), network monitoring– anywhere you’d want to process gigabytes, terabytes, or petabytes of data.


Pillars of Hadoop

HDFS exists to split, distribute, and manage chunks of the overall data set, which could be a single file or a
directory full of files. These chunks of data are pre-loaded onto the worker nodes, which later process them in the MapReduce phase. By having the data local at process time, HDFS saves all of the headache and inefficiency of shuffling data back and forth across the network.
In the MapReduce phase, each worker node spins up one or more tasks (which can either be Map or Reduce).Map tasks are assigned based on data locality, if at all possible. A Map task will be assigned to the worker node where the data resides. Reduce tasks (which are optional) then typically aggregate the output of all of the dozens,hundreds, or thousands of map tasks, and produce final output.

The Map and Reduce programs are where your specific logic lies, and seasoned programmers will immediately recognize Map as a common built-in function or data type in many languages, for example,
map(function,iterable) in Python, or array_map(callback, array) in PHP. All map does is run a userdefined
function (your logic) on every element of a given array. For example, we could define a function
squareMe, which does nothing but return the square of a number. We could then pass an array of numbers to
a map call, telling it to run squareMe on each. So an input array of (2,3,4,5) would return (4,9,16,25), and our call would look like (in Python) map(“squareMe”,array(‘i’,[2,3,4,5]).

Hadoop will parse the data in HDFS into user-defined keys and values, and each key and value will then be
passed to your Mapper code. In the case of image processing, each value may be the binary contents of your image file, and your Mapper may simply run a user-defined convertToPdf function against each file. In this case, you wouldn’t even need a Reducer, as the Mappers would simply write out the PDF file to some datastore (like HDFS or S3).This is what the New York Times did when converting their archives.

Consider, however, if you wished to count the occurrences of a list of “good/bad” keywords in all customer
chat sessions, twitter feeds, public Facebook posts, and/or e-mails in order to gauge customer satisfaction. Your good list may look like happy, appreciate, “great job”, awesome, etc., while your bad list may look like unhappy, angry, mad, horrible, etc., and your total data set of all chat sessions and emails may be hundreds of GB. In this case, each Mapper would work only on a subset of that overall data, and the Reducer would be used to compile the final count, summing up outputs of all the Map tasks.

At its core, Hadoop is really that simple. It takes care of all the underlying complexity, making sure that each record is processed, that the overall job runs quickly, and that failure of any individual task (or hardware/network failure) is handled gracefully. You simply bring your Map (and optionally Reduce) logic, and Hadoop processes every record in your dataset with that logic.


Why Hadoop??
The fact that Hadoop can do all the above is not the compelling argument for it’s use. Other technologies
have been around for a long, long while which can and do address everything we’ve listed so far. What makes Hadoop shine, however, is that it performs these tasks in minutes or hours, for little or no cost versus the days or weeks and substantial costs (licensing, product, specialized hardware) of previous solutions

Hadoop does this by abstracting out all of the difficult work in analyzing large data sets, performing its work on commodity hardware, and scaling linearly. -- Add twice as many worker nodes, and your processing will generally complete 2 times faster. With datasets growing larger and larger, Hadoop has become the solitary solution businesses turn to when they need fast, reliable processing of large, growing data sets for little cost.

Where to start Learning ?

Here are five steps to start learning Hadoop

  1.     Download and install Ubuntu Linux server 32-bit
  2.     Read about Hadoop (what's Hadoop, Hadoop architecture, MapReduce, and HDFS)
  3.     Start with installing Hadoop on a single node
  4.     Do some examples (like wordcount to test how it works)
  5.     Start doing multiple nodes


References : wiki.apache.org/hadoop/

No comments:

Post a Comment