Quantcast
Channel: Leadership Experience from Trenches - Big Data
Viewing all articles
Browse latest Browse all 9

What problem does Hadoop solve?

$
0
0
In next weeks, I will be reading this new book, I bought from Amazon “Hadoop: The Definite Guide”. By no means I take any credit for the content. I hope this blog/notes saves you from writing your own notes. This blog should also come handy when you are short on time.
 
Lets read and learn together :-)
 
Data is growing every day – whether it is individual’s data footprint or corporate. Data is not only produced by humans but significantly by machines (RFID, Server Log etc.).
The problem is simple:
Problem # 1:
While the storage capacities of hard drives have increased massively over the years, access speeds—the rate at which data can be read from drives — have not kept up. There is a longer time to read all data on a single drive of 1TB—and writing is even slower.
In 1990, 1,370MB of data would take 5 mins to read at the speed of 4.4MB/Sec. After 20 years, 1TB of data would take 2 and 1/2 hours to read at the speed of 100MB/Sec.
The obvious way to reduce the time is to read from multiple disks at once. We can store 1TB of data on 100 disks, each would only contain 1/100 of 1TB and read in parallel. Now the same data could be read in less than 2 mins. Now, why to use only 1/100 of a disk space? These set of 100 disks may share other datasets.
Problem # 2:
Hardware Failure – RFID takes the approach of building redundancy by creating backup disks. HDFS (Hadoop File System) takes a slight different approach.
Problem # 3:
Data Aggregation -  Data is available in multiple sources and multiple formats. MapReduce provides a programming model that abstracts the problem from disk reads and writes, transforming it into a computation over sets of keys and values. MapReduce is a batch query processor, and the ability to run an ad hoc query against your whole dataset and get the results in a reasonable time is transformative.

How does Hadoop compare with other existing technologies?

1. Relational Database Management System: Structured data is a pre-requisite but Hadoop entertains semi-Unstructured data and unstructured data.
 Traditional RDBMSMapReduce
Data SizeGigabytesPetabytes
AccessInteractive and batchBatch
UpdatesRead and write many timesWrite once, read many times
StructureStatic SchemaDynamic Schema
IntegrityHighLow
ScalingNonlinearLinear

*Over time, however, the differences between relational databases and MapReduce systems are likely to blur—both as relational databases start incorporating some of the ideas from MapReduce (such as Aster Data’s and Greenplum’s databases) and, from the other direction, as higher-level query languages built on MapReduce (such as Pig and Hive) make MapReduce systems more approachable to traditional database
programmers.


2. High Performance Computing and/or Grid Computing
 
HPC/Grid ComputingHadoop
CPU CentricData centric
Shared Filesystem over SANData locality - collocate the data with the compute node (saves network bandwidth)
Low level C programming and algorithm implementationHigh level; abstracted
Process coordination is cumbersome especially in handling process partial failuresShared Nothing philosophy – no dependency of tasks on each other
  
  


Stay turned. Lot more notes to come :-) Happy learning.


Viewing all articles
Browse latest Browse all 9

Trending Articles