When to use Hadoop, HBase, Hive and Pig?

Pankti 2017-11-29T19:08:39

Understanding in depth\n\nHadoop\n\nHadoop is an open source project of the Apache foundation. It is a framework written in Java, originally developed by Doug Cutting in 2005. It was created to support distribution for Nutch, the text search engine. Hadoop uses Google's Map Reduce and Google File System Technologies as its foundation. \n\nFeatures of Hadoop\n\n\nIt is optimized to handle massive quantities of structured, semi-structured and unstructured data using commodity hardware. \nIt has shared nothing architecture.\nIt replicates its data into multiple computers so that if one goes down, the data can still be processed from another machine that stores its replica. \nHadoop is for high throughput rather than low latency. It is a batch operation handling massive quantities of data; therefore the response time is not immediate.\nIt complements Online Transaction Processing and Online Analytical Processing. However, it is not a replacement for a RDBMS.\nIt is not good when work cannot be parallelized or when there are dependencies within the data.\nIt is not good for processing small files. It works best with huge data files and data sets.\n\n\nVersions of Hadoop\n\nThere are two versions of Hadoop available :\n\n\nHadoop 1.0\nHadoop 2.0\n\n\nHadoop 1.0\n\nIt has two main parts :\n\n1. Data Storage Framework\n\nIt is a general-purpose file system called Hadoop Distributed File System (HDFS).\n\nHDFS is schema-less\n\nIt simply stores data files and these data files can be in just about any format.\n\nThe idea is to store files as close to their original form as possible.\n\nThis in turn provides the business units and the organization the much needed flexibility and agility without being overly worried by what it can implement.\n\n2. Data Processing Framework\n\nThis is a simple functional programming model initially popularized by Google as MapReduce. \n\nIt essentially uses two functions: MAP and REDUCE to process data.\n\nThe \"Mappers\" take in a set of key-value pairs and generate intermediate data (which is another list of key-value pairs).\n\nThe \"Reducers\" then act on this input to produce the output data.\n\nThe two functions seemingly work in isolation with one another, thus enabling the processing to be highly distributed in highly parallel, fault-tolerance and scalable way.\n\nLimitations of Hadoop 1.0\n\n\nThe first limitation was the requirement of MapReduce programming expertise.\nIt supported only batch processing which although is suitable for tasks such as log analysis, large scale data mining projects but pretty much unsuitable for other kinds of projects.\nOne major limitation was that Hadoop 1.0 was tightly computationally coupled with MapReduce, which meant that the established data management vendors where left with two opinions:\n\n\nEither rewrite their functionality in MapReduce so that it could be \nexecuted in Hadoop or\nExtract data from HDFS or process it outside of Hadoop.\n\n\n\nNone of the options were viable as it led to process inefficiencies caused by data being moved in and out of the Hadoop cluster.\n\nHadoop 2.0\n\nIn Hadoop 2.0, HDFS continues to be data storage framework.\n\nHowever, a new and seperate resource management framework called Yet Another Resource Negotiater (YARN) has been added.\n\nAny application capable of dividing itself into parallel tasks is supported by YARN.\n\nYARN coordinates the allocation of subtasks of the submitted application, thereby further enhancing the flexibility, scalability and efficiency of applications.\n\nIt works by having an Application Master in place of Job Tracker, running applications on resources governed by new Node Manager.\n\nApplicationMaster is able to run any application and not just MapReduce.\n\nThis means it does not only support batch processing but also real-time processing. MapReduce is no longer the only data processing option.\n\nAdvantages of Hadoop\n\nIt stores data in its native from. There is no structure imposed while keying in data or storing data. HDFS is schema less. It is only later when the data needs to be processed that the structure is imposed on the raw data.\n\nIt is scalable. Hadoop can store and distribute very large datasets across hundreds of inexpensive servers that operate in parallel.\n\nIt is resilient to failure. Hadoop is fault tolerance. It practices replication of data diligently which means whenever data is sent to any node, the same data also gets replicated to other nodes in the cluster, thereby ensuring that in event of node failure,there will always be another copy of data available for use.\n\nIt is flexible. One of the key advantages of Hadoop is that it can work with any kind of data: structured, unstructured or semi-structured. Also, the processing is extremely fast in Hadoop owing to the \"move code to data\" paradigm.\n\nHadoop Ecosystem\n\nFollowing are the components of Hadoop ecosystem:\n\nHDFS: Hadoop Distributed File System. It simply stores data files as close to the original form as possible.\n\nHBase: It is Hadoop's database and compares well with an RDBMS. It supports structured data storage for large tables.\n\nHive: It enables analysis of large datasets using a language very similar to standard ANSI SQL, which implies that anyone familier with SQL should be able to access data on a Hadoop cluster.\n\nPig: It is an easy to understand data flow language. It helps with analysis of large datasets which is quite the order with Hadoop. Pig scripts are automatically converted to MapReduce jobs by the Pig interpreter.\n\nZooKeeper: It is a coordination service for distributed applications.\n\nOozie: It is a workflow schedular system to manage Apache Hadoop jobs.\n\nMahout: It is a scalable machine learning and data mining library.\n\nChukwa: It is data collection system for managing large distributed system.\n\nSqoop: It is used to transfer bulk data between Hadoop and structured data stores such as relational databases.\n\nAmbari: It is a web based tool for provisioning, managing and monitoring Hadoop clusters.\n\nHive\n\nHive is a data warehouse infrastructure tool to process structured data in Hadoop. It resides on top of Hadoop to summarize Big Data and makes querying and analyzing easy.\n\nHive is not\n\n\nA relational database\nA design for Online Transaction Processing (OLTP).\nA language for real-time queries and row-level updates.\n\n\nFeatures of Hive\n\n\nIt stores schema in database and processed data into HDFS.\nIt is designed for OLAP.\nIt provides SQL type language for querying called HiveQL or HQL.\nIt is familier, fast, scalable and extensible.\n\n\nHive Architecture\n\nThe following components are contained in Hive Architecture:\n\n\nUser Interface: Hive is a data warehouse infrastructure that can create interaction between user and HDFS. The User Interfaces that Hive supports are Hive Web UI, Hive Command line and Hive HD Insight(In Windows Server).\nMetaStore: Hive chooses respective database servers to store the schema or Metadata of tables, databases, columns in a table, their data types and HDFS mapping.\nHiveQL Process Engine: HiveQL is similar to SQL for querying on schema info on the Metastore. It is one of the replacements of traditional approach for MapReduce program. Instead of writing MapReduce in Java, we can write a query for MapReduce and process it.\nExceution Engine: The conjunction part of HiveQL process engine and MapReduce is the Hive Execution Engine. Execution engine processes the query and generates results as same as MapReduce results. It uses the flavor of MapReduce.\nHDFS or HBase: Hadoop Distributed File System or HBase are the data storage techniques to store data into file system.\n",

user11768920 2019-08-11T19:43:08

I believe this thread hasn't done in particular justice to HBase and Pig in particular. While I believe Hadoop is the choice of the distributed, resilient file-system for big-data lake implementations, the choice between HBase and Hive is in particular well-segregated. \n\nAs in, a lot of use-cases have a particular requirement of SQL like or No-SQL like interfaces. With Phoenix on top of HBase, though SQL like capabilities is certainly achievable, however, the performance, third-party integrations, dashboard update are a kind of painful experiences. However, it's an excellent choice for databases requiring horizontal scaling. \n\nPig is in particular excellent for non-recursive batch like computations or ETL pipelining (somewhere, where it outperforms Spark by a comfortable distance). Also, it's high-level dataflow implementations is an excellent choice for batch querying and scripting. The choice between Pig and Hive is also pivoted on the need of the client or server-side scripting, required file formats, etc. Pig supports Avro file format which is not true in the case of Hive. The choice for 'procedural dataflow language' vs 'declarative data flow language' is also a strong argument for the choice between pig and hive. ",

shazin 2012-12-17T10:27:32

For a Comparison Between Hadoop Vs Cassandra/HBase read this post.\n\nBasically HBase enables really fast read and writes with scalability. How fast and scalable? Facebook uses it to manage its user statuses, photos, chat messages etc. HBase is so fast sometimes stacks have been developed by Facebook to use HBase as the data store for Hive itself.\n\nWhere As Hive is more like a Data Warehousing solution. You can use a syntax similar to SQL to query Hive contents which results in a Map Reduce job. Not ideal for fast, transactional systems. ",

nixxo_raa 2020-01-25T12:54:50

\n Hadoop:\n\n\nHDFS stands for Hadoop Distributed File System which uses Computational processing model Map-Reduce. \n\n\n HBase:\n\n\nHBase is Key-Value storage, good for reading and writing in near real time. \n\n\n Hive:\n\n\nHive is used for data extraction from the HDFS using SQL-like syntax. Hive use HQL language.\n\n\n Pig:\n\n\nPig is a data flow language for creating ETL. It's an scripting language.",

akshat thakar 2015-01-16T07:31:43

I worked on Lambda architecture processing Real time and Batch loads. \nReal time processing is needed where fast decisions need to be taken in case of Fire alarm send by sensor or fraud detection in case of banking transactions.\nBatch processing is needed to summarize data which can be feed into BI systems.\n\nwe used Hadoop ecosystem technologies for above applications.\n\nReal Time Processing\n\nApache Storm: Stream Data processing, Rule application\n\nHBase: Datastore for serving Realtime dashboard\n\nBatch Processing\nHadoop: Crunching huge chunk of data. 360 degrees overview or adding context to events. Interfaces or frameworks like Pig, MR, Spark, Hive, Shark help in computing. This layer needs scheduler for which Oozie is good option.\n\nEvent Handling layer\n\nApache Kafka was first layer to consume high velocity events from sensor. \nKafka serves both Real Time and Batch analytics data flow through Linkedin connectors.",

user4058730 2015-07-26T13:45:04

First of all we should get clear that Hadoop was created as a faster alternative to RDBMS. To process large amount of data at a very fast rate which earlier took a lot of time in RDBMS.\n\nNow one should know the two terms : \n\n\nStructured Data : This is the data that we used in traditional RDBMS and is divided into well defined structures.\nUnstructured Data : This is important to understand, about 80% of the world data is unstructured or semi structured. These are the data which are on its raw form and cannot be processed using RDMS. Example : facebook, twitter data. (http://www.dummies.com/how-to/content/unstructured-data-in-a-big-data-environment.html).\n\n\nSo, large amount of data was being generated in the last few years and the data was mostly unstructured, that gave birth to HADOOP. It was mainly used for very large amount of data that takes unfeasible amount of time using RDBMS. It had many drawbacks, that it could not be used for comparatively small data in real time but they have managed to remove its drawbacks in the newer version.\n\nBefore going further I would like to tell that a new Big Data tool is created when they see a fault on the previous tools. So, whichever tool you will see that is created has been done to overcome the problem of the previous tools.\n\nHadoop can be simply said as two things : Mapreduce and HDFS. Mapreduce is where the processing takes place and HDFS is the DataBase where data is stored. This structure followed WORM principal i.e. write once read multiple times. So, once we have stored data in HDFS, we cannot make changes. This led to the creation of HBASE, a NOSQL product where we can make changes in the data also after writing it once.\n\nBut with time we saw that Hadoop had many faults and for that we created different environment over the Hadoop structure. PIG and HIVE are two popular examples.\n\nHIVE was created for people with SQL background. The queries written is similar to SQL named as HIVEQL. HIVE was developed to process completely structured data. It is not used for ustructured data.\n\nPIG on the other hand has its own query language i.e. PIG LATIN. It can be used for both structured as well as unstructured data.\n\nMoving to the difference as when to use HIVE and when to use PIG, I don't think anyone other than the architect of PIG could say. Follow the link :\nhttps://developer.yahoo.com/blogs/hadoop/comparing-pig-latin-sql-constructing-data-processing-pipelines-444.html",

Sandeep Giri 2016-05-31T06:48:41

Let me try to answer in few words.\n\nHadoop is an eco-system which comprises of all other tools. So, you can't compare Hadoop but you can compare MapReduce.\n\nHere are my few cents:\n\n\nHive: If your need is very SQLish meaning your problem statement can be catered by SQL, then the easiest thing to do would be to use Hive. The other case, when you would use hive is when you want a server to have certain structure of data.\nPig: If you are comfortable with Pig Latin and you need is more of the data pipelines. Also, your data lacks structure. In those cases, you could use Pig. Honestly there is not much difference between Hive & Pig with respect to the use cases.\nMapReduce: If your problem can not be solved by using SQL straight, you first should try to create UDF for Hive & Pig and then if the UDF is not solving the problem then getting it done via MapReduce makes sense.\n",

y durga prasad 2017-07-10T09:07:02

Pig: it is better to handle files and cleaning data\n example: removing null values,string handling,unnecessary values\nHive: for querying on cleaned data",

Tariq 2012-12-17T12:56:57

MapReduce is just a computing framework. HBase has nothing to do with it. That said, you can efficiently put or fetch data to/from HBase by writing MapReduce jobs. Alternatively you can write sequential programs using other HBase APIs, such as Java, to put or fetch the data. But we use Hadoop, HBase etc to deal with gigantic amounts of data, so that doesn't make much sense. Using normal sequential programs would be highly inefficient when your data is too huge.\nComing back to the first part of your question, Hadoop is basically 2 things: a Distributed FileSystem (HDFS) + a Computation or Processing framework (MapReduce). Like all other FS, HDFS also provides us storage, but in a fault tolerant manner with high throughput and lower risk of data loss (because of the replication). But, being a FS, HDFS lacks random read and write access. This is where HBase comes into picture. It's a distributed, scalable, big data store, modelled after Google's BigTable. It stores data as key/value pairs.\nComing to Hive. It provides us data warehousing facilities on top of an existing Hadoop cluster. Along with that it provides an SQL like interface which makes your work easier, in case you are coming from an SQL background. You can create tables in Hive and store data there. Along with that you can even map your existing HBase tables to Hive and operate on them.\nWhile Pig is basically a dataflow language that allows us to process enormous amounts of data very easily and quickly. Pig basically has 2 parts: the Pig Interpreter and the language, PigLatin. You write Pig script in PigLatin and using Pig interpreter process them. Pig makes our life a lot easier, otherwise writing MapReduce is always not easy. In fact in some cases it can really become a pain.\nI had written an article on a short comparison of different tools of the Hadoop ecosystem some time ago. It's not an in depth comparison, but a short intro to each of these tools which can help you to get started.\n(Just to add on to my answer. No self promotion intended)\nBoth Hive and Pig queries get converted into MapReduce jobs under the hood.\nHTH",

swaroop 2017-12-07T04:23:21

1.We are using Hadoop for storing Large data (i.e.structure,Unstructure and Semistructure data ) in the form file format like txt,csv.\n\n2.If We want columnar Updations in our data then we are using Hbase tool\n\n3.In case of Hive , we are storing Big data which is in structured format \nand in addition to that we are providing Analysis on that data.\n\n4.Pig is tool which is using Pig latin language to analyze data which is in any format(structure,semistructure and unstructure).",

Sanjay Subramanian 2013-07-29T23:56:32

I implemented a Hive Data platform recently in my firm and can speak to it in first person since I was a one man team.\n\nObjective\n\n\nTo have the daily web log files collected from 350+ servers daily queryable thru some SQL like language\nTo replace daily aggregation data generated thru MySQL with Hive\nBuild Custom reports thru queries in Hive \n\n\nArchitecture Options\n\nI benchmarked the following options:\n\n\nHive+HDFS\nHive+HBase - queries were too slow so I dumped this option\n\n\nDesign\n\n\nDaily log Files were transported to HDFS\nMR jobs parsed these log files and output files in HDFS\nCreate Hive tables with partitions and locations pointing to HDFS locations\nCreate Hive query scripts (call it HQL if you like as diff from SQL) that in turn ran MR jobs in the background and generated aggregation data\nPut all these steps into an Oozie workflow - scheduled with Daily Oozie Coordinator\n\n\nSummary\n\nHBase is like a Map. If you know the key, you can instantly get the value. But if you want to know how many integer keys in Hbase are between 1000000 and 2000000 that is not suitable for Hbase alone.\n\nIf you have data that needs to be aggregated, rolled up, analyzed across rows then consider Hive.\n\nHopefully this helps.\n\nHive actually rocks ...I know, I have lived it for 12 months now... So does HBase...",

Akshay Sharma 2018-05-12T04:09:04

Cleansing Data in Pig is very easy,a suitable approach would be cleansing data through pig and then processing data through hive and later uploading it to hdfs.",

Ravindra babu 2015-10-30T10:14:28

From Apache official website: https://hadoop.apache.org/\n\nHadoop is a a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models.\n\nSome more projects, which are part of Hadoop:\n\n\nHBase™: A scalable, distributed database that supports structured data storage for large tables.\n\n\n\n\nHive™: A data warehouse infrastructure that provides data summarization and ad-hoc querying.\n\n\n\n\nPig™: A high-level data-flow language and execution framework for parallel computation.\n\n\nHive Vs PIG comparison can be found at this SE post.\nHBASE won't replace Map Reduce. HBase is scalable distributed database & Map Reduce is programming model for distributed processing of data. Map Reduce may act on data in HBASE in processing.\nYou can use HIVE/HBASE for structured/semi-structured data and process it with Hadoop Map Reduce\nHive should be used for analytical querying of data collected over a period of time. Few examples : Calculate trends, summarize website logs but it can't be used for real time queries.\nHBase fits for real-time querying of Big Data.\nPIG can be used to construct dataflows, run a scheduled jobs, crunch big volumes of data, aggregate/summarize it and store into relation database systems. It is good for ad-hoc analysis.\nHive can be used for ad-hoc data analysis but it can't support all un-structured data formats unlike PIG.",

user1326784 2018-08-02T04:06:36

Use of Hive, Hbase and Pig w.r.t. my real time experience in different projects.\n\nHive is used mostly for:\n\n\nAnalytics purpose where you need to do analysis on history data\nGenerating business reports based on certain columns\nEfficiently managing the data together with metadata information\nJoining tables on certain columns which are frequently used by using bucketing concept\nEfficient Storing and querying using partitioning concept\nNot useful for transaction/row level operations like update, delete, etc.\n\n\nPig is mostly used for:\n\n\nFrequent data analysis on huge data\nGenerating aggregated values/counts on huge data\nGenerating enterprise level key performance indicators very frequently \n\n\nHbase is mostly used:\n\n\nFor real time processing of data\nFor efficiently managing Complex and nested schema \nFor real time querying and faster result\nFor easy Scalability with columns\nUseful for transaction/row level operations like update, delete, etc.\n",

David Gruzman 2012-12-17T20:47:26

Consider that you work with RDBMS and have to select what to use - full table scans, or index access - but only one of them. \nIf you select full table scan - use hive. If index access - HBase. ",

Ajay Ahuja 2019-06-24T15:43:45

Short answer to this question is -\n\nHadoop - Is Framework which facilitates distributed file system and programming model which allow us to store humongous sized data and process data in distributed fashion very efficiently and with very less processing time compare to traditional approaches.\n\n(HDFS - Hadoop Distributed File system)\n(Map Reduce - Programming Model for distributed processing) \n\nHive - Is query language which allows to read/write data from Hadoop distributed file system in a very popular SQL like fashion. This made life easier for many non-programming background people as they don't have to write Map-Reduce program anymore except for very complex scenarios where Hive is not supported.\n\nHbase - Is Columnar NoSQL Database. Underlying storage layer for Hbase is again HDFS. Most important use case for this database is to be able to store billion's of rows with million's of columns. Low latency feature of Hbase helps faster and random access of record over distributed data, is very important feature to make it useful for complex projects like Recommender Engines. Also it's record level versioning capability allow user to store transactional data very efficiently (this solves the problem of updating records we have with HDFS and Hive)\n\nHope this is helpful to quickly understand the above 3 features.",