Synopses & Reviews
Gain hands-on experience with HDF5 for storing scientific data in Python. This practical guide quickly gets you up to speed on the details, best practices, and pitfalls of using HDF5 to archive and share numerical datasets ranging in size from gigabytes to terabytes.
Through real-world examples and practical exercises, youll explore topics such as scientific datasets, hierarchically organized groups, user-defined metadata, and interoperable files. Examples are applicable for users of both Python 2 and Python 3. If youre familiar with the basics of Python data analysis, this is an ideal introduction to HDF5.
- Get set up with HDF5 tools and create your first HDF5 file
- Work with datasets by learning the HDF5 Dataset object
- Understand advanced features like dataset chunking and compression
- Learn how to work with HDF5s hierarchical structure, using groups
- Create self-describing files by adding metadata with HDF5 attributes
- Take advantage of HDF5s type system to create interoperable files
- Express relationships among data with references, named types, and dimension scales
- Discover how Python mechanisms for writing parallel code interact with HDF5
For anyone with a basic background in Python data analysis and NumPy,Collette explains how to use HDF5 (Hierarchical Data Format version five) data storage and communication system from Python. Heemphasizes the native HDF5 feature set, rather than higher-level abstractions on the Python side, in order to make the book as usefulas possible for creating portable files. The material supports both Python 2 and Python 3, he says, and while the examples are writtenfor Python 2, he notes differences for Python 3 when they are relevant. This is one of O'Reilly's Nutshell Handbooks.Annotation ©2014 Ringgold, Inc., Portland, OR (protoview.com)
With the rise of the Python-NumPy stack for analysis, one area which is under-documented at the moment is that of storage for large scientific datasets. When this topic is discussed, it is usually within the context of the native data-archiving features in specific Python packages, for example, pandas. While such packages may use open formats on the back end, no in-depth work currently exists covering the nuts-and-bolts, best practices, and pitfalls of dealing with gigabyte-to-terabyte-sized datasets from Python.
This book aims to fill that gap in the market, by providing practical coverage of the use of HDF5 to archive and share binary data in Python.
About the Author
Andrew Collette holds a Ph.D. in physics from UCLA, and works as a laboratory research scientist at the University of Colorado. He has worked with the Python-NumPy-HDF5 stack at two multimillion-dollar research facilities; the first being the Large Plasma Device at UCLA (entirely standardized on HDF5), and the second being the hypervelocity dust accelerator at the Colorado Center for Lunar Dust and Atmospheric Studies, University of Colorado at Boulder. Additionally, Dr. Collette is a leading developer of the HDF5 for Python (h5py) project.
Table of Contents
Preface; Conventions Used in This Book; Using Code Examples; Safari® Books Online; How to Contact Us; Acknowledgments; Chapter 1: Introduction; 1.1 Python and HDF5; 1.2 What Exactly Is HDF5?; Chapter 2: Getting Started; 2.1 HDF5 Basics; 2.2 Setting Up; 2.3 The HDF5 Tools; 2.4 Your First HDF5 File; Chapter 3: Working with Datasets; 3.1 Dataset Basics; 3.2 Reading and Writing Data; 3.3 Resizing Datasets; Chapter 4: How Chunking and Compression Can Help You; 4.1 Contiguous Storage; 4.2 Chunked Storage; 4.3 Setting the Chunk Shape; 4.4 Performance Example: Resizable Datasets; 4.5 Filters and Compression; 4.6 Other Filters; 4.7 Third-Party Filters; Chapter 5: Groups, Links, and Iteration: The "H" in HDF5; 5.1 The Root Group and Subgroups; 5.2 Group Basics; 5.3 Working with Links; 5.4 Iteration and Containership; 5.5 Multilevel Iteration with the Visitor Pattern; 5.6 Copying Objects; 5.7 Object Comparison and Hashing; Chapter 6: Storing Metadata with Attributes; 6.1 Attribute Basics; 6.2 Real-World Example: Accelerator Particle Database; Chapter 7: More About Types; 7.1 The HDF5 Type System; 7.2 Integers and Floats; 7.3 Fixed-Length Strings; 7.4 Variable-Length Strings; 7.5 Compound Types; 7.6 Complex Numbers; 7.7 Enumerated Types; 7.8 Booleans; 7.9 The array Type; 7.10 Opaque Types; 7.11 Dates and Times; Chapter 8: Organizing Data with References, Types, and Dimension Scales; 8.1 Object References; 8.2 Region References; 8.3 Named Types; 8.4 Dimension Scales; Chapter 9: Concurrency: Parallel HDF5, Threading, and Multiprocessing; 9.1 Python Parallel Basics; 9.2 Threading; 9.3 Multiprocessing; 9.4 MPI and Parallel HDF5; Chapter 10: Next Steps; 10.1 Asking for Help; 10.2 Contributing; Index; Colophon;