Insights from paper: Apache Hive: From MapReduce to Enterprise-grade Big DataWarehousing

Hemant Gupta
10 min readAug 23, 2024

--

1. Abstract

Apache Hive is an open-source relational database system for analytic workloads for big data.

The paper describes critical innovations in Apache Hive’s journey from a batch tool to a fully-fledged enterprise data warehousing system.

Apache Hive has a hybrid architecture. It combines traditional Massively Parallel Processing(MPP) techniques with big data and cloud concepts.

Hive can achieve the scale and performance required by today’s analytic applications.

The paper explores the system from four perspectives: Transactions, optimizer, runtime, and federation.

2. Introduction

Hive was introduced in 2010. At that time, authors exposed an SQL-like interface on top of Hadoop MapReduce so that users did not have to deal with low-level implementation details.

Hive focused mainly on Extract-Transform-Load (ETL) or batch reporting workloads.

Users started migrating their data warehousing workloads from other systems to Hadoop. These workloads required a low-latency SQL engine for interactive and ad-hoc reporting, dashboarding, and other business intelligence use cases.

Many parallel efforts have started to solve this. Impala and Presto emerged out of this.

The Hive community concluded that the current implementation of Hive provides an excellent foundation to support these workloads.

Hive was already designed for the following things:

  1. Large-scale reliable computation in Hadoop
  2. SQL compatibility
  3. Connectivity to other data management systems

The only need was to support new use cases.

This paper describes the following significant qualities:

  1. SQL and ACID support
  2. Optimization techniques
  3. Runtime latency
  4. Federation capabilities

3. System Architecture

The diagram below shows the high-level architecture of Hive.

Data storage: Data in Hive can be stored using any supported file format. ORC and Parquet are the most commonly supported file formats.

The file system should also be compatible with HDFS. All major commercial cloud object stores, such as AWS S3 and Azure Blob Storage, are good. Hive can also read and write data to other standalone processing systems, such as Druid or HBase.

Data catalog: Hive stores all information about its data sources using the Hive Metastore (HMS).

HMS is a catalog for all data queryable by Hive. It persists the information using an RDBMS.

Hive relies on DataNucleus, a Java object-relational mapping implementation, to simplify the support of multiple RDBMS at the backend.

The HMS API supports multiple programming languages. The service is implemented using Thrift.

Exchangeable data processing runtime: Hive has gradually moved away from MapReduce to support more flexible processing runtimes compatible with YARN.

Currently, the most popular runtime for Hive is Tez. It provides more flexibility than MapReduce by modeling data processing as DAGs, with vertices representing application logic and edges representing data transfer.

Query server: HiveServer2 (HS2) allows users to execute SQL queries in Hive.

HS2 supports local and remote JDBC and ODBC connections. Hive distribution includes a JDBC thin client called Beeline.

Query preparation stages in HiveServer2

The diagram above shows the stages of an SQL query in HS2.

  1. A user submits a query to HS2. The driver parses the query, generating a logical plan (Apache Calcite based) from its AST.
  2. The Calcite plan is then optimized.
  3. The plan is converted into a physical plan.
  4. HS2 executes additional optimizations on the physical plan DAG. A vectorized plan may be generated from it.
  5. The physical plan is passed to the task compiler. It breaks the operator tree into a DAG of executable tasks.
  6. The driver submits these tasks to the runtime application manager in YARN. YARN handles the execution.
  7. The driver fetches the results for the query and returns them to the user.

One important thing to note is that Hive implements an individual task compiler for each supported processing runtime, i.e., Tez, Spark, and MapReduce.

4. SQL and ACID Support

The two critical things for a data warehouse is SQL and ACID semantics support.

Let’s talk about both of these.

Hive has also preserved multiple features of its original query language.

An example is the PARTITIONED BY columns clause. This clause allows user to partition a table horizontally.

CREATE TABLE store_sales (
sold_date_sk INT, item_sk INT, customer_sk INT, store_sk INT,
quantity INT, list_price DECIMAL(7,2), sales_price DECIMAL(7,2)
PARTITIONED BY (sold_date_sk INT);

Hive stores the data for each set of partition values in a different directory in the file system, so it can easily skip scanning full partitions. The diagram below shows an example.

Example- Physical layout for partitioned table

4.1 SQL support

Hive uses a nested data model.

It supports all major atomic SQL and non-atomic data types such as STRUCT, ARRAY, and MAP.

4.2 ACID implementation

Hive support to execute INSERT, UPDATE, DELETE, and MERGE statements. It provides ACID guarantees via Snapshot Isolation.

It uses a transaction manager built on top of the HMS. At the time of paper writing, transactions could only span a single statement; multi-statement transactions were a work in progress.

Users can write to multiple tables within a single transaction using Hive multi-insert statements.

Transaction and lock management

Hive stores transaction and locking information state in HMS. It uses a global transaction identifier, or TxnId, which is a monotonically increasing value generated by the Metastore.

Each TxnId maps into one or multiple write identifiers orWriteIds.

A WriteId is a monotonically increasing value generated by the Metastore within a table scope.

The WriteId is stored with each record written by a transaction, and all records written by the same transaction to the same table share the same WriteId.

There can be many files sharing the same WriteId. A unique FileId identifies these files.

Each record within a file is identified uniquely by a RowId field.

This way, the combination of WriteId, FileId, and RowId uniquely identifies each record in a table.

HS2 obtains a logical snapshot of the data that needs to read when a query is executed. The snapshot is represented by a transaction list comprising the highest allocated TxnId and the set of open and aborted transactions below it.

HS2 first generates the WriteId list from the transaction list. The readers in that scan will skip rows whose WriteId is

  1. higher than the high watermark
  2. part of the set of open and aborted transactions.

Data and file layout

We have seen that Hive stores data for each table and partition in a different directory.

Hive uses different stores or directories within each table or partition to support concurrent read and write operations: base and delta.

The files in the base store contain all valid records up to a certain WriteId.

A delta directory contains files with records within a WriteId range.

Hive keeps separate delta directories for inserted and deleted records.

Update operations are split into delete and insert operations.

An insert or delete transaction creates a delta directory with records bound to a single WriteId.

Compaction

Hive merges

  1. Files in delta directories with other files in delta directories (minor compaction)
  2. Files in delta directories with files in base directories (major compaction)

Compaction does the following things:

  1. Decrease the number of directories and files in tables.
  2. Reduce the readers effort to merge files at query execution time.
  3. Short the set of open and aborted TxnIds and WriteIds associated with each snapshot

Compaction is triggered automatically by HS2 when certain thresholds are surpassed.

5. Query Optimization

5.1 Rule and cost-based optimizer

Hive uses many complex techniques for optimization.

Hive has introduced a new plan representation and optimizer powered by Apache Calcite. It is a modular and extensible query optimizer with built-in elements that can be combined in different ways to build your own optimization logic.

Calcite provides two different planner engines:

  1. A cost-based planner, which triggers rewriting rules with the goal of reducing the overall expression cost
  2. An exhaustive planner triggers rules exhaustively until it generates an expression no longer modified by any rules.

5.2 Query re-optimization

Hive supports query re-optimization when certain errors are thrown during execution.

There are two independent reoptimization strategies.

  1. Overlay: It changes certain configuration parameters for all query reexecutions.
  2. Reoptimize: It estimates the size of the intermediate results in the plan based on statistics. If those estimates are not accurate, replanning is done.

5.3 Query results cache

The query cache provides scalability advantages for repetitive identical queries.

Each HS2 instance keeps its query cache component.

This component keeps a map from the query AST representation to an entry containing the result's location and information of the snapshot of the data.

The query cache has a pending entry mode, which protects against a thundering herd of identical queries when data is updated and a cache miss is observed.

5.4 Materialized views and rewriting

Apache Hive supports materialized views and automatic query rewriting based on those materializations.

The optimizer relies on Calcite to automatically produce full and partially contained rewritings on Select—Project-Join-Aggregate (SPJA) query expressions.

Materialized view maintenance

When new data is inserted, or existing data is modified in source tables, materialized view.

At the time of writing the paper, the user needs to manually trigger the rebuild operation for a materialized view using a REBUILD statement.

Hive attempts to rebuild a materialized view incrementally.

5.5 Shared work optimization

Hive can identify overlapping subexpressions within the execution plan of a given query and compute them only once.

5.6 Dynamic semijoin reduction

The semijoin reduction technique reduces the size of intermediate results during query execution.

This optimization is beneficial for star schema databases.

The following SQL query shows an example. The store_sales and store_returns are fact tables, and the item is a dimension table.

SELECT ss_customer_sk, SUM(ss_sales_price) AS sum_sales
FROM store_sales, store_returns, item
WHERE ss_item_sk = sr_item_sk AND
ss_ticket_number = sr_ticket_number AND
ss_item_sk = i_item_sk AND
i_category = 'Sports'
GROUP BY ss_customer_sk
ORDER BY sum_sales DESC;

Hive evaluates the filtered subexpression, and subsequently, the values produced by the expression are used to skip reading records from the rest of the tables.

6. Query Execution

Hive transitioned from MapReduce to Apache for internal execution. It also implemented columnar-based storage formats and vectorized operators. This reduced query latency in Hive by orders of magnitude.

Hive made a few more innovations to enhance performance further.

6.1 LLAP: Live Long and Process

Live Long and Process are also known as LLAP. It is an option layer.

It provides persistent multi-threaded query executors and a multi-tenant in-memory cache.

LLAP enhances the existing execution runtimes.

The data I/O, caching, and query fragment execution capabilities of LLAP are encapsulated within daemons.

Daemons are set up to run continuously in the worker nodes. They avoid any start-up overhead.

The daemons are stateless: each contains several executors to run several query fragments in parallel and a local work queue.

I/O elevator: This daemon uses separate threads to offload data I/O and decompression. Data is read in batches and transformed into an internal run length encoded (RLE) columnar format ready for vectorized processing.

Data caching: LLAP features an off-heap cache as its primary buffer pool for holding data in the I/O elevator. LLAP caches metadata and data from the input files.

Query fragment execution: LLAP daemons execute arbitrary query plan fragments containing operations such as filters, projections, data transformations, joins, partial aggregates, and sorting.

6.2 Workload management improvements

The workload manager controls the access to LLAP resources for each query executed by Hive.

An administrator can create resource plans.

A resource plan consists of:

  1. One or more pools of resources with a maximum amount of resources and number of concurrent queries per pool.
  2. Mappings, which root incoming queries to pools based on specific query properties.
  3. Triggers initiate an action, such as killing queries in a pool or moving queries from one pool to another.

7. Federated Warehouse System

Hive can act as a mediator, designed to support querying over multiple independent data management systems.

As a mediator, Hive can globally enforce access control and capture audit trails via Ranger or Sentry.

7.1 Storage handlers

Hive includes a storage handler interface that needs to be implemented for each storage engine.

A storage handler consists of:

  1. An input format
  2. An output format
  3. A SerDe (serializer and deserializer)
  4. A Metastore hook — It defines notification methods invoked as part of the transactions against HMS.

Example: Hive provides a Druid storage handler to use its efficiency to execute interactive queries.

7.2 Pushing computation using Calcite

Hive can leverage Calcite adapters. This is a great feature for pushing complex computation to supported systems and generating queries in the languages supported by those systems.

Example: Hive can push operations to Druid and multiple engines with JDBC support using Calcite.

The diagram below shows a query executed over a table stored in Druid and the corresponding plan and JSON query generated by Calcite.

8. Conclusion

Apache Hive made data load and management simple.

It made software and hardware failures gracefully without expensive repair or recovery times.

Hive has expanded its utility from an ETL tool to a fully-fledged enterprise-grade data warehouse.

It has added the capabilities of transactional systems.

It has made improvements to bring query latency and concurrency that are good enough for interactive operation.

Hive can also be used as a relational front-end to multiple storage and data systems.

References

Meta Presto paper post

Apache Thrift

DataNucleus: JDO/JPA/REST Persistence of Java Objects

Apache Hadoop YARN: yet another resource negotiator

Apache Tez: A Unifying Framework for Modeling and Building Data Processing Applications

Apache Calcite: A Foundational Framework for Optimized Query Processing Over Heterogeneous Data Sources

Hive — A Petabyte Scale Data Warehouse Using Hadoop

Google MapReduce paper post

Impala: A Modern, Open-Source SQL Engine for Hadoop

--

--

Hemant Gupta

https://www.linkedin.com/in/hkgupta/ Working on the challenge to create insights for 100 software eng papers. #100PapersChallenge, AVP Engineering at CoinDCX