Synopses & Reviews
Although you dont need a large computing infrastructure to process massive amounts of data with Apache Hadoop, it can still be difficult to get started. This practical guide shows you how to quickly launch data analysis projects in the cloud by using Amazon Elastic MapReduce (EMR), the hosted Hadoop framework in Amazon Web Services (AWS).
Authors Kevin Schmidt and Christopher Phillips demonstrate best practices for using EMR and various AWS and Apache technologies by walking you through the construction of a sample MapReduce log analysis application. Using code samples and example configurations, youll learn how to assemble the building blocks necessary to solve your biggest data analysis problems.
- Get an overview of the AWS and Apache software tools used in large-scale data analysis
- Go through the process of executing a Job Flow with a simple log analyzer
- Discover useful MapReduce patterns for filtering and analyzing data sets
- Use Apache Hive and Pig instead of Java to build a MapReduce Job Flow
- Learn the basics for using Amazon EMR to run machine learning algorithms
- Develop a project cost model for using Amazon EMR and other AWS tools
Amazon now brings the power of Hadoop to the cloud and this book helps you take advantage of it. Youll learn how to move your data to the cloud and analyze datasets utilizing a combination of Amazon EC2, S3, and JobFlows in Amazon EMR. Unlock the power of processing large volumes of data and only pay for what you use with Amazon MapReduce services. Programming Elastic MapReduce gets you started.
- Store large datasets into Amazon Simple Storage Services (S3)
- Run Amazon Elastic Compute Cloud (EC2) and Elastic MapReduce (EMR) Jobs to analyze large data sets
- Discover common pitfalls and advanced features for writing real-world MapReduce programs using Pig
- Examine case studies and common uses
- Learn to analyze the cost and plan your projects
About the Author
Q&A with Chris Phillips, co-author of "Programming Elastic MapReduce"
Q. What makes “Programming Elastic MapReduce” important right now?
A. Big Data and Hadoop are hot technologies now with many companies exploring how they can use the technology to benefit their business and their customers. However, the upfront investment in a large Hadoop cluster and allocating space for racks of servers in the traditional data center can be a great barrier to entry for organizations that want to explore the technology and learn how it can benefit their business. Amazon Elastic MapReduce eliminates this barrier and allows organizations to explore the technology without the upfront costs and only pay for the resources they use.
NetFlix and Airbnb are among the well-known organizations that use Amazon Elastic MapReduce heavily.
Q. What do you hope that readers of your book will walk away with?
A. The hope in writing Programming Elastic MapReduce is to show the reader how easy it is to build an application in Amazon EMR and that they can start building their application today without building clusters of servers and finding space and resources to manage a Hadoop cluster. The reader will learn the multitude of language and technology options available to build and Amazon EMR application and can go from a development laptop to a running cloud based cluster in minutes.
Q. What's the most exciting and important thing happening in this space currently?
A. Data Science is a rapidly growing field with the fields of business intelligence, statistics, and computer science coming together to help business solve new problems. According to Gartner, the market will require 100,000+ data scientists by 2020. Companies like Kaggle.com now run data science competitions to source some of the best and brightest data scientists to help companies solve their data analysis problems. We are just starting to see businesses leverage this technology in examples like NetFlix's recommendation engine. The power of this technology is only now starting to be realized with tremendous growth in the future. Our book helps developers and programmer interested in this field a way to learn the technology and have a platform to start projects with low upfront costs.
Q. Can you provide a few tips on how to get started with Elastic MapReduce?
A. 1. Move your data to AWS: Before you can start processing data with Amazon Elastic MapReduce, you will need to move your data to Amazon S3. s3cmd and AWS Command Line are two easy to use command line utilities that can be used inside AWS or on individual servers in your data center to transfer data to S3 so it can be in a location to be processed by Amazon EMR. For very large data sets, organizations should explore the AWS Import/Export service to send their data to Amazon on physical storage.
2. Pick the right problem to solve: When people first learn about Hadoop or Elastic MapReduce, they think of the technology similar to database technology. Elastic MapReduce is more like a batch processing system. Elastic MapReduce can ingest a large amount of data and process it faster and more efficently than a traditional database. However, the way EMR processes this data is similar to a table scan where all of the data is processed and analyzed. EMR can not perform as efficently as a traditional database in retrieving a small number of rows from a large dataset. Additional technologies like Amazon Redshift and HBase can be used with Amazon EMR to get the benefits of both a traditional database and Hadoop.
3. Save money using spot instances: Amazon EMR's latest console released in November 2013 allows a user to resize a cluster quickly. A cost effective way of processing data in EMR is to start or increase the size of a running cluster with a number of task nodes that use spot instances. Spot instances let you name the price you are willing to pay for additional capacity and prices are typically far below Amazon's on-demand prices.
4. Set up persistent and transient Amazon EMR clusters: An Amazon EMR cluster can be set up to terminate once the cluster completes all the steps in the Job Flow. This type of Amazon EMR cluster is considered a transient cluster since it only lives for the life of the job flow it needs to complete. An Amazon EMR cluster can be set up to continue running and wait for additional steps. There are pros and cons to both of these cluster types and the use of these clusters will depend on your application. However, a few rules of thumb may help in the selection thats right for you.
Transient clusters can be used to save money on Amazon EMR costs. If your data flow is sporadic, it may make sense to queue up a bunch of data in S3 and only start an EMR cluster once a week, day, or hour depending on your need. This allows you to save money on times your cluster sits idle waiting for work to arrive. You can use Amazon Cloudwatch to monitor your cluster to see if your data and workloads would benefit from using transient EMR clusters. Amazon Data Pipeline can help you build workflows that trigger EMR cluster creation when the right conditions exist to process data.
A persistent EMR cluster can be the right choice for your organization if the results of your data analysis are time critical or the data flow is consistent enough to necessitate constant data analysis processing. Your application and data processing will have lower processing overhead without the need to regularly build up and tear down EMR clusters.
5. Experiment with EMR Cluster node types: Throughout the book, we typically use the smallest and fewest number of instances in an Amazon EMR cluster. This helps reduce the costs associated with learning Amazon EMR. However, your application will need much more than this when running in a production setting with real world demands. Some applications will be more memory intensive, CPU intensive, or even disk read and write intensive. To find out what is right for your application, experiment with different instance types and number of instances with a small subset of your data to learn what size EMR cluster meets your data processing time and AWS cost requirements.
Table of Contents
Preface; What Is AWS?; What's in This Book?; Sign Up for AWS; Code Samples in This Book; Conventions Used in This Book; Using Code Examples; Safari® Books Online; How to Contact Us; Acknowledgments; Chapter 1: Introduction to Amazon Elastic MapReduce; 1.1 Amazon Web Services Used in This Book; 1.2 Amazon Elastic MapReduce; 1.3 Amazon EMR and the Hadoop Ecosystem; 1.4 Amazon Elastic MapReduce Versus Traditional Hadoop Installs; 1.5 Application Building Blocks; Chapter 2: Data Collection and Data Analysis with AWS; 2.1 Log Analysis Application; 2.2 Log Messages as a Data Set for Analytics; 2.3 Understanding MapReduce; 2.4 Collection Stage; 2.5 Simulating Syslog Data; 2.6 Developing a MapReduce Application; 2.7 Custom JAR MapReduce Job; 2.8 Running an Amazon EMR Cluster; 2.9 Viewing Our Results; 2.10 Debugging a Job Flow; 2.11 Our Application and Real-World Uses; Chapter 3: Data Filtering Design Patterns and Scheduling Work; 3.1 Extending the Application Example; 3.2 Understanding Web Server Logs; 3.3 Finding Errors in the Web Logs Using Data Filtering; 3.4 Building Summary Counts in Data Sets; 3.5 Job Flow Scheduling; 3.6 Scheduling with AWS Data Pipeline; 3.7 Real-World Uses; Chapter 4: Data Analysis with Hive and Pig in Amazon EMR; 4.1 Amazon Job Flow Technologies; 4.2 What Is Pig?; 4.3 Utilizing Pig in Amazon EMR; 4.4 What Is Hive?; 4.5 Utilizing Hive in Amazon EMR; 4.6 Our Application with Hive and Pig; Chapter 5: Machine Learning Using EMR; 5.1 A Quick Tour of Machine Learning; 5.2 Python and EMR; 5.3 What's Next?; Chapter 6: Planning AWS Projects and Managing Costs; 6.1 Developing a Project Cost Model; 6.2 Optimizing AWS Resources to Reduce Project Costs; 6.3 Amazon Tools for Estimating Your Project Costs; Amazon Web Services Resources and Tools; Amazon AWS Online Resources; Amazon AWS Cost Estimation Tools; AWS Best Practices and Architecture; Amazon EMR Distributions; Cloud Computing, Amazon Web Services, and Their Impacts; AWS Service Delivery Models; Performance; Elasticity and Growth; Security; Uptime and Availability; Installation and Setup; Prerequisites; Installing Hadoop; Building MapReduce Applications; Running MapReduce Applications Locally; Installing Pig; Installing Hive; Index; Colophon;