Synopses & Reviews
Need to move a relational database application to Hadoop? This comprehensive guide introduces you to Apache Hive, Hadoops data warehouse infrastructure. Youll quickly learn how to use Hives SQL dialect—HiveQL—to summarize, query, and analyze large datasets stored in Hadoops distributed filesystem.
This example-driven guide shows you how to set up and configure Hive in your environment, provides a detailed overview of Hadoop and MapReduce, and demonstrates how Hive works within the Hadoop ecosystem. Youll also find real-world case studies that describe how companies have used Hive to solve unique problems involving petabytes of data.
- Use Hive to create, alter, and drop databases, tables, views, functions, and indexes
- Customize data formats and storage options, from files to external databases
- Load and extract data from tables—and use queries, grouping, filtering, joining, and other conventional query methods
- Gain best practices for creating user defined functions (UDFs)
- Learn Hive patterns you should use and anti-patterns you should avoid
- Integrate Hive with other data processing programs
- Use storage handlers for NoSQL databases and other datastores
- Learn the pros and cons of running Hive on Amazons Elastic MapReduce
Synopsis
Hive makes life much easier for developers who work with stored and managed data in Hadoop clusters, such as data warehouses. With this example-driven guide, youll learn how to use the Hive infrastructure to provide data summarization, query, and analysis—particularly with HiveQL, the query language dialect of SQL.
Youll learn how to set up Hive in your environment and optimize its use, and how it interoperates with other tools, such as HBase. Youll also learn how to extend Hive with custom code written in Java or scripting languages. Ideal for developers with prior SQL experience, this book shows you how Hive simplifies many tasks that would be much harder to implement in the lower-level MapReduce API provided by Hadoop.
About the Author
Edward Capriolo is currently System Administrator at Media6degrees where he helps design and maintain distributed data storage systems for the internet advertising industry.
Edward is a member of the Apache Software Foundation and a committer for the Hadoop-Hive project. He has experience as a developer as well Linux and network administrator and enjoys the rich world of open source software.
Dean Wampler is a Principal Consultant at Think Big Analytics, where he specializes in "Big Data" problems and tools like Hadoop and Machine Learning. Besides Big Data, he specializes in Scala, the JVM ecosystem, JavaScript, Ruby, functional and object-oriented programming, and Agile methods. Dean is a frequent speaker at industry and academic conferences on these topics. He has a Ph.D. in Physics from the University of Washington.
Jason Rutherglen is a software architect at Think Big Analytics and specializes in Big Data, Hadoop, search, and security.
Table of Contents
Preface; Conventions Used in This Book; Using Code Examples; Safari® Books Online; How to Contact Us; What Brought Us to Hive?; Acknowledgments; Chapter 1: Introduction; 1.1 An Overview of Hadoop and MapReduce; 1.2 Hive in the Hadoop Ecosystem; 1.3 Java Versus Hive: The Word Count Algorithm; 1.4 What's Next; Chapter 2: Getting Started; 2.1 Installing a Preconfigured Virtual Machine; 2.2 Detailed Installation; 2.3 What Is Inside Hive?; 2.4 Starting Hive; 2.5 Configuring Your Hadoop Environment; 2.6 The Hive Command; 2.7 The Command-Line Interface; Chapter 3: Data Types and File Formats; 3.1 Primitive Data Types; 3.2 Collection Data Types; 3.3 Text File Encoding of Data Values; 3.4 Schema on Read; Chapter 4: HiveQL: Data Definition; 4.1 Databases in Hive; 4.2 Alter Database; 4.3 Creating Tables; 4.4 Partitioned, Managed Tables; 4.5 Dropping Tables; 4.6 Alter Table; Chapter 5: HiveQL: Data Manipulation; 5.1 Loading Data into Managed Tables; 5.2 Inserting Data into Tables from Queries; 5.3 Creating Tables and Loading Them in One Query; 5.4 Exporting Data; Chapter 6: HiveQL: Queries; 6.1 SELECT ... FROM Clauses; 6.2 WHERE Clauses; 6.3 GROUP BY Clauses; 6.4 JOIN Statements; 6.5 ORDER BY and SORT BY; 6.6 DISTRIBUTE BY with SORT BY; 6.7 CLUSTER BY; 6.8 Casting; 6.9 Queries that Sample Data; 6.10 UNION ALL; Chapter 7: HiveQL: Views; 7.1 Views to Reduce Query Complexity; 7.2 Views that Restrict Data Based on Conditions; 7.3 Views and Map Type for Dynamic Tables; 7.4 View Odds and Ends; Chapter 8: HiveQL: Indexes; 8.1 Creating an Index; 8.2 Rebuilding the Index; 8.3 Showing an Index; 8.4 Dropping an Index; 8.5 Implementing a Custom Index Handler; Chapter 9: Schema Design; 9.1 Table-by-Day; 9.2 Over Partitioning; 9.3 Unique Keys and Normalization; 9.4 Making Multiple Passes over the Same Data; 9.5 The Case for Partitioning Every Table; 9.6 Bucketing Table Data Storage; 9.7 Adding Columns to a Table; 9.8 Using Columnar Tables; 9.9 (Almost) Always Use Compression!; Chapter 10: Tuning; 10.1 Using EXPLAIN; 10.2 EXPLAIN EXTENDED; 10.3 Limit Tuning; 10.4 Optimized Joins; 10.5 Local Mode; 10.6 Parallel Execution; 10.7 Strict Mode; 10.8 Tuning the Number of Mappers and Reducers; 10.9 JVM Reuse; 10.10 Indexes; 10.11 Dynamic Partition Tuning; 10.12 Speculative Execution; 10.13 Single MapReduce MultiGROUP BY; 10.14 Virtual Columns; Chapter 11: Other File Formats and Compression; 11.1 Determining Installed Codecs; 11.2 Choosing a Compression Codec; 11.3 Enabling Intermediate Compression; 11.4 Final Output Compression; 11.5 Sequence Files; 11.6 Compression in Action; 11.7 Archive Partition; 11.8 Compression: Wrapping Up; Chapter 12: Developing; 12.1 Changing Log4J Properties; 12.2 Connecting a Java Debugger to Hive; 12.3 Building Hive from Source; 12.4 Setting Up Hive and Eclipse; 12.5 Hive in a Maven Project; 12.6 Unit Testing in Hive with hive_test; 12.7 The New Plugin Developer Kit; Chapter 13: Functions; 13.1 Discovering and Describing Functions; 13.2 Calling Functions; 13.3 Standard Functions; 13.4 Aggregate Functions; 13.5 Table Generating Functions; 13.6 A UDF for Finding a Zodiac Sign from a Day; 13.7 UDF Versus GenericUDF; 13.8 Permanent Functions; 13.9 User-Defined Aggregate Functions; 13.10 User-Defined Table Generating Functions; 13.11 Accessing the Distributed Cache from a UDF; 13.12 Annotations for Use with Functions; 13.13 Macros; Chapter 14: Streaming; 14.1 Identity Transformation; 14.2 Changing Types; 14.3 Projecting Transformation; 14.4 Manipulative Transformations; 14.5 Using the Distributed Cache; 14.6 Producing Multiple Rows from a Single Row; 14.7 Calculating Aggregates with Streaming; 14.8 CLUSTER BY, DISTRIBUTE BY, SORT BY; 14.9 GenericMR Tools for Streaming to Java; 14.10 Calculating Cogroups; Chapter 15: Customizing Hive File and Record Formats; 15.1 File Versus Record Formats; 15.2 Demystifying CREATE TABLE Statements; 15.3 File Formats; 15.4 Record Formats: SerDes; 15.5 CSV and TSV SerDes; 15.6 ObjectInspector; 15.7 Think Big Hive Reflection ObjectInspector; 15.8 XML UDF; 15.9 XPath-Related Functions; 15.10 JSON SerDe; 15.11 Avro Hive SerDe; 15.12 Binary Output; Chapter 16: Hive Thrift Service; 16.1 Starting the Thrift Server; 16.2 Setting Up Groovy to Connect to HiveService; 16.3 Connecting to HiveServer; 16.4 Getting Cluster Status; 16.5 Result Set Schema; 16.6 Fetching Results; 16.7 Retrieving Query Plan; 16.8 Metastore Methods; 16.9 Administrating HiveServer; 16.10 Hive ThriftMetastore; Chapter 17: Storage Handlers and NoSQL; 17.1 Storage Handler Background; 17.2 HiveStorageHandler; 17.3 HBase; 17.4 Cassandra; 17.5 DynamoDB; Chapter 18: Security; 18.1 Integration with Hadoop Security; 18.2 Authentication with Hive; 18.3 Authorization in Hive; Chapter 19: Locking; 19.1 Locking Support in Hive with Zookeeper; 19.2 Explicit, Exclusive Locks; Chapter 20: Hive Integration with Oozie; 20.1 Oozie Actions; 20.2 A Two-Query Workflow; 20.3 Oozie Web Console; 20.4 Variables in Workflows; 20.5 Capturing Output; 20.6 Capturing Output to Variables; Chapter 21: Hive and Amazon Web Services (AWS); 21.1 Why Elastic MapReduce?; 21.2 Instances; 21.3 Before You Start; 21.4 Managing Your EMR Hive Cluster; 21.5 Thrift Server on EMR Hive; 21.6 Instance Groups on EMR; 21.7 Configuring Your EMR Cluster; 21.8 Persistence and the Metastore on EMR; 21.9 HDFS and S3 on EMR Cluster; 21.10 Putting Resources, Configs, and Bootstrap Scripts on S3; 21.11 Logs on S3; 21.12 Spot Instances; 21.13 Security Groups; 21.14 EMR Versus EC2 and Apache Hive; 21.15 Wrapping Up; Chapter 22: HCatalog; 22.1 Introduction; 22.2 MapReduce; 22.3 Command Line; 22.4 Security Model; 22.5 Architecture; Chapter 23: Case Studies; 23.1 m6d.com (Media6Degrees); 23.2 Outbrain; 23.3 NASA's Jet Propulsion Laboratory; 23.4 Photobucket; 23.5 SimpleReach; 23.6 Experiences and Needs from the Customer Trenches; Glossary; References; Colophon;