Synopses & Reviews
Hadoop: The Definitive Guide helps you harness the power of your data. Ideal for processing large datasets, the Apache Hadoop framework is an open source implementation of the MapReduce algorithm on which Google built its empire. This comprehensive resource demonstrates how to use Hadoop to build reliable, scalable, distributed systems: programmers will find details for analyzing large datasets, and administrators will learn how to set up and run Hadoop clusters.
Complete with case studies that illustrate how Hadoop solves specific problems, this book helps you:
- Use the Hadoop Distributed File System (HDFS) for storing large datasets, and run distributed computations over those datasets using MapReduce
- Become familiar with Hadoop's data and I/O building blocks for compression, data integrity, serialization, and persistence
- Discover common pitfalls and advanced features for writing real-world MapReduce programs
- Design, build, and administer a dedicated Hadoop cluster, or run Hadoop in the cloud
- Use Pig, a high-level query language for large-scale data processing
- Take advantage of HBase, Hadoop's database for structured and semi-structured data
- Learn ZooKeeper, a toolkit of coordination primitives for building distributed systems
If you have lots of data -- whether it's gigabytes or petabytes -- Hadoop is the perfect solution. Hadoop: The Definitive Guide is the most thorough book available on the subject.
"Now you have the opportunity to learn about Hadoop from a master-not only of the technology, but also of common sense and plain talk." -- Doug Cutting, Hadoop Founder, Yahoo!
The growing popularity of Apache Cassandra rests on this database's ability to handle very large data sets that include hundreds of terabytes. This hands-on guide provides the details and practical examples developers need to understand Cassandra's non-relational database design and how to take advantage of it in a production environment.
What could you do with data if scalability wasn't a problem? With this hands-on guide, you'll learn how Apache Cassandra handles hundreds of terabytes of data while remaining highly available across multiple data centers -- capabilities that have attracted Facebook, Twitter, and other data-intensive companies. Cassandra: The Definitive Guide provides the technical details and practical examples you need to assess this database management system and put it to work in a production environment.
Author Eben Hewitt demonstrates the advantages of Cassandra's nonrelational design, and pays special attention to data modeling. If you're a developer, DBA, application architect, or manager looking to solve a database scaling issue or future-proof your application, this guide shows you how to harness Cassandra's speed and flexibility.
- Understand the tenets of Cassandra's column-oriented structure
- Learn how to write, update, and read Cassandra data
- Discover how to add or remove nodes from the cluster as your application requires
- Examine a working application that translates from a relational model to Cassandra's data model
- Use examples for writing clients in Java, Python, and C#
- Use the JMX interface to monitor a cluster's usage, memory patterns, and more
- Tune memory settings, data storage, and caching for better performance
The growing popularity of Apache Cassandra rests on this databases ability to handle very large data sets that include hundreds of terabytes. This hands-on guide provides the details and practical examples you need to understand Cassandras non-relational database design and how to take advantage of it in a production environment.
Author Eben Hewitt (Java SOA Cookbook) pays special attention to data modeling, and demonstrates Cassandras many advantages, including its high availability, eventual consistency model, and ability to scale easily. If youre a developer with a startup, youll learn how to future-proof your application by implementing Cassandra before your storage needs become critical. Join Twitter, Cisco, Digg, Reddit, and other data-intensive organizations that have come to rely on Cassandras NoSQL design. This book shows you how.
- Understand the tenets of NoSQL and Cassandras column-oriented structure
- Get best practices for configuring, monitoring, and performance tuning
- Learn how to write, update, and read Cassandra data
- Discover how Cassandras distributed design lets you add or remove nodes from the cluster as your application requires
- Get examples for writing clients in Java, Python, C#, and Scala
- Extend Cassandra by integrating it with the Hadoop framework
'\'\\\'Organizations large and small are adopting Apache Hadoop to deal with huge application data sets, and this comprehensive resource provides the key for unlocking the wealth this data holds.\\\\n
About the Author
Tom White has been an Apache Hadoop committer since February 2007, and is a member of the Apache Software Foundation. He works for Cloudera, a company set up to offer Hadoop support and training. Previously he was as an independent Hadoop consultant, working with companies to set up, use, and extend Hadoop. He has written numerous articles for O'Reilly, java.net and IBM's developerWorks, and has spoken at several conferences, including at ApacheCon 2008 on Hadoop. Tom has a Bachelor's degree in Mathematics from the University of Cambridge and a Master's in Philosophy of Science from the University of Leeds, UK.
Table of Contents
Foreword; Preface; Why Apache Cassandra?; Is This Book for You?; What's in This Book?; Finding Out More; Conventions Used in This Book; Using Code Examples; Safari® Enabled; How to Contact Us; Acknowledgments; Chapter 1: Introducing Cassandra; 1.1 What's Wrong with Relational Databases?; 1.2 A Quick Review of Relational Databases; 1.3 The Cassandra Elevator Pitch; 1.4 Where Did Cassandra Come From?; 1.5 Use Cases for Cassandra; 1.6 Who Is Using Cassandra?; 1.7 Summary; Chapter 2: Installing Cassandra; 2.1 Installing the Binary; 2.2 Building from Source; 2.3 Running Cassandra; 2.4 Running the Command-Line Client Interface; 2.5 Basic CLI Commands; 2.6 Summary; Chapter 3: The Cassandra Data Model; 3.1 The Relational Data Model; 3.2 A Simple Introduction; 3.3 Clusters; 3.4 Keyspaces; 3.5 Column Families; 3.6 Columns; 3.7 Super Columns; 3.8 Design Differences Between RDBMS and Cassandra; 3.9 Design Patterns; 3.10 Some Things to Keep in Mind; 3.11 Summary; Chapter 4: Sample Application; 4.1 Data Design; 4.2 Hotel App RDBMS Design; 4.3 Hotel App Cassandra Design; 4.4 Hotel Application Code; 4.5 Twissandra; 4.6 Summary; Chapter 5: The Cassandra Architecture; 5.1 System Keyspace; 5.2 Peer-to-Peer; 5.3 Gossip and Failure Detection; 5.4 Anti-Entropy and Read Repair; 5.5 Memtables, SSTables, and Commit Logs; 5.6 Hinted Handoff; 5.7 Compaction; 5.8 Bloom Filters; 5.9 Tombstones; 5.10 Staged Event-Driven Architecture (SEDA); 5.11 Managers and Services; 5.12 Summary; Chapter 6: Configuring Cassandra; 6.1 Keyspaces; 6.2 Replicas; 6.3 Replica Placement Strategies; 6.4 Replication Factor; 6.5 Partitioners; 6.6 Snitches; 6.7 Creating a Cluster; 6.8 Dynamic Ring Participation; 6.9 Security; 6.10 Miscellaneous Settings; 6.11 Additional Tools; 6.12 Summary; Chapter 7: Reading and Writing Data; 7.1 Query Differences Between RDBMS and Cassandra; 7.2 Basic Write Properties; 7.3 Consistency Levels; 7.4 Basic Read Properties; 7.5 The API; 7.6 Setup and Inserting Data; 7.7 Using a Simple Get; 7.8 Seeding Some Values; 7.9 Slice Predicate; 7.10 Get Range Slices; 7.11 Multiget Slice; 7.12 Deleting; 7.13 Batch Mutates; 7.14 Programmatically Defining Keyspaces and Column Families; 7.15 Summary; Chapter 8: Clients; 8.1 Basic Client API; 8.2 Thrift; 8.3 Avro; 8.4 A Bit of Git; 8.5 Connecting Client Nodes; 8.6 Cassandra Web Console; 8.7 Hector (Java); 8.8 HectorSharp (C#); 8.9 Chirper; 8.10 Chiton (Python); 8.11 Pelops (Java); 8.12 Kundera (Java ORM); 8.13 Fauna (Ruby); 8.14 Summary; Chapter 9: Monitoring; 9.1 Logging; 9.2 Overview of JMX and MBeans; 9.3 Interacting with Cassandra via JMX; 9.4 Cassandra's MBeans; 9.5 Custom Cassandra MBeans; 9.6 Runtime Analysis Tools; 9.7 Health Check; 9.8 Summary; Chapter 10: Maintenance; 10.1 Getting Ring Information; 10.2 Getting Statistics; 10.3 Basic Maintenance; 10.4 Snapshots; 10.5 Load-Balancing the Cluster; 10.6 Decommissioning a Node; 10.7 Updating Nodes; 10.8 Summary; Chapter 11: Performance Tuning; 11.1 Data Storage; 11.2 Reply Timeout; 11.3 Commit Logs; 11.4 Memtables; 11.5 Concurrency; 11.6 Caching; 11.7 Buffer Sizes; 11.8 Using the Python Stress Test; 11.9 Startup and JVM Settings; 11.10 Summary; Chapter 12: Integrating Hadoop; 12.1 What Is Hadoop?; 12.2 Working with MapReduce; 12.3 Running the Word Count Example; 12.4 Tools Above MapReduce; 12.5 Cluster Configuration; 12.6 Use Cases; 12.7 Summary; The Nonrelational Landscape; Nonrelational Databases; Object Databases; XML Databases; Document-Oriented Databases; Graph Databases; Key-Value Stores and Distributed Hashtables; Columnar Databases; Summary; Glossary; Colophon;