Synopses & Reviews
What is bad data? Some people consider it a technical phenomenon, like missing values or malformed records, but bad data includes a lot more. In this handbook, data expert Q. Ethan McCallum has gathered 19 colleagues from every corner of the data arena to reveal how theyve recovered from nasty data problems.
From cranky storage to poor representation to misguided policy, there are many paths to bad data. Bottom line? Bad data is data that gets in the way. This book explains effective ways to get around it.
Among the many topics covered, youll discover how to:
- Test drive your data to see if its ready for analysis
- Work spreadsheet data into a usable form
- Handle encoding problems that lurk in text data
- Develop a successful web-scraping effort
- Use NLP tools to reveal the real sentiment of online reviews
- Address cloud computing issues that can impact your analysis effort
- Avoid policies that create data analysis roadblocks
- Take a systematic approach to data quality analysis
Welcome to data sciences dirty secret: real-world data is messy. Data scientists must spend a good deal of time playing software developer, writing code to clean up data before they can actually do anything constructive with it.
Its a necessary evil, but you can still make the most of it. This practical book walks you through several real-world examples to demonstrate the theory and practice behind working with and cleaning up dirty data.
No one tool solves all of the problems well. Wise data scientists learn many tools and learn where each one shines. To that end, this book takes a polyglot approach: most examples will involve R and Python, but expect the occasional smattering of Groovy and sed/awk fun.
Even if you're relatively new to the data science field, you've likely encountered your share of bad data: missing values and arcane file formats are rather pedestrian matters. But those are just the beginning. The idea of bad data is an ecosystem unto itself, that also includes mismatches in character set, data that changes behind your back, and data you don't know how to handle on your own.
In short, bad data is data that gets in the way.
In the Bad Data Handbook, Q. Ethan McCallum gathers cast of authors to explore the wide variety of data headaches, including:
- Different forms of bad data, and how to spot it
- Techniques for wrangling bad data
- Infrastructure and policy matters that will impact your data analysis efforts
- Procedures to keep bad data from getting worse (and, perhaps, to help it get better)
About the Author
Q Ethan McCallum is a consultant, writer, and technology enthusiast, though perhaps not in that order. His work has appeared online on The OReilly Network and Java.net, and also in print publications such as C/C++ Users Journal, Doctor Dobbs Journal, and Linux Magazine. In his professional roles, he helps companies to make smart decisions about data and technology.
Table of Contents
About the Authors; Preface; Conventions Used in This Book; Using Code Examples; Safari® Books Online; How to Contact Us; Acknowledgments; Chapter 1: Setting the Pace: What Is Bad Data?; Chapter 2: Is It Just Me, or Does This Data Smell Funny?; 2.1 Understand the Data Structure; 2.2 Field Validation; 2.3 Value Validation; 2.4 Physical Interpretation of Simple Statistics; 2.5 Visualization; 2.6 Keyword PPC Example; 2.7 Search Referral Example; 2.8 Recommendation Analysis; 2.9 Time Series Data; 2.10 Conclusion; Chapter 3: Data Intended for Human Consumption, Not Machine Consumption; 3.1 The Data; 3.2 The Problem: Data Formatted for Human Consumption; 3.3 The Solution: Writing Code; 3.4 Postscript; 3.5 Other Formats; 3.6 Summary; Chapter 4: Bad Data Lurking in Plain Text; 4.1 Which Plain Text Encoding?; 4.2 Guessing Text Encoding; 4.3 Normalizing Text; 4.4 Problem: Application-Specific Characters Leaking into Plain Text; 4.5 Text Processing with Python; 4.6 Exercises; Chapter 5: (Re)Organizing the Web's Data; 5.1 Can You Get That?; 5.2 General Workflow Example; 5.3 The Real Difficulties; 5.4 The Dark Side; 5.5 Conclusion; Chapter 6: Detecting Liars and the Confused in Contradictory Online Reviews; 6.1 Weotta; 6.2 Getting Reviews; 6.3 Sentiment Classification; 6.4 Polarized Language; 6.5 Corpus Creation; 6.6 Training a Classifier; 6.7 Validating the Classifier; 6.8 Designing with Data; 6.9 Lessons Learned; 6.10 Summary; 6.11 Resources; Chapter 7: Will the Bad Data Please Stand Up?; 7.1 Example 1: Defect Reduction in Manufacturing; 7.2 Example 2: Who's Calling?; 7.3 Example 3: When "Typical" Does Not Mean "Average"; 7.4 Lessons Learned; 7.5 Will This Be on the Test?; Chapter 8: Blood, Sweat, and Urine; 8.1 A Very Nerdy Body Swap Comedy; 8.2 How Chemists Make Up Numbers; 8.3 All Your Database Are Belong to Us; 8.4 Check, Please; 8.5 Live Fast, Die Young, and Leave a Good-Looking Corpse Code Repository; 8.6 Rehab for Chemists (and Other Spreadsheet Abusers); 8.7 tl;dr; Chapter 9: When Data and Reality Don't Match; 9.1 Whose Ticker Is It Anyway?; 9.2 Splits, Dividends, and Rescaling; 9.3 Bad Reality; 9.4 Conclusion; Chapter 10: Subtle Sources of Bias and Error; 10.1 Imputation Bias: General Issues; 10.2 Reporting Errors: General Issues; 10.3 Other Sources of Bias; 10.4 Conclusions; 10.5 References