Vincent Granville, Ph.D. is a data scientist with 15 years of big data, predictive modeling, and business analytics experience. He is the co-founder of Data Science Central, which includes a robust editorial platform, social interaction, forum-based technical support, the latest in technology tools and trends, and industry job opportunities.
Introduction xxi
Chapter 1 What Is Data Science? 1
Real Versus Fake Data Science 2
Two Examples of Fake Data Science 5
The Face of the New University 6
The Data Scientist 9
Data Scientist Versus Data Engineer 9
Data Scientist Versus Statistician 11
Data Scientist Versus Business Analyst 12
Data Science Applications in 13 Real-World Scenarios 13
Scenario 1: DUI Arrests Decrease After
End of State Monopoly on Liquor Sales 14
Scenario 2: Data Science and Intuition 15
Scenario 3: Data Glitch Turns Data Into Gibberish 18
Scenario 4: Regression in Unusual Spaces 19
Scenario 5: Analytics Versus Seduction to Boost Sales 20
Scenario 6: About Hidden Data 22
Scenario 7: High Crime Rates Caused by Gasoline Lead. Really? 23
Scenario 8: Boeing Dreamliner Problems 23
Scenario 9: Seven Tricky Sentences for NLP 24
Scenario 10: Data Scientists Dictate What We Eat? 25
Scenario 11: Increasing Amazon.com Sales with Better Relevancy 27
Scenario 12: Detecting Fake Profiles or Likes on Facebook 29
Scenario 13: Analytics for Restaurants 30
Data Science History, Pioneers, and Modern Trends 30
Statistics Will Experience a Renaissance 31
History and Pioneers 32
Modern Trends 34
Recent Q&A Discussions 35
Summary 39
Chapter 2 Big Data Is Different 41
Two Big Data Issues 41
The Curse of Big Data 41
When Data Flows Too Fast 45
Examples of Big Data Techniques 51
Big Data Problem Epitomizing the
Challenges of Data Science 51
Clustering and Taxonomy Creation for Massive Data Sets 53
Excel with 100 Million Rows 57
What MapReduce Can’t Do 60
The Problem 61
Three Solutions 61
Conclusion: When to Use MapReduce 63
Communication Issues 63
Data Science: The End of Statistics? 65
The Eight Worst Predictive Modeling Techniques 65
Marrying Computer Science, Statistics, and Domain Expertise 67
The Big Data Ecosystem 70
Summary 71
Chapter 3 Becoming a Data Scientist 73
Key Features of Data Scientists 73
Data Scientist Roles 73
Horizontal Versus Vertical Data Scientist 75
Types of Data Scientists 78
Fake Data Scientist 78
Self-Made Data Scientist 78
Amateur Data Scientist 79
Extreme Data Scientist 80
Data Scientist Demographics 82
Training for Data Science 82
University Programs 82
Corporate and Association Training Programs 86
Free Training Programs 87
Data Scientist Career Paths 89
The Independent Consultant 89
The Entrepreneur 95
Summary 107
Chapter 4 Data Science Craftsmanship, Part I 109
New Types of Metrics 110
Metrics to Optimize Digital Marketing Campaigns 111
Metrics for Fraud Detection 112
Choosing Proper Analytics Tools 113
Analytics Software 114
Visualization Tools 115
Real-Time Products 116
Programming Languages 117
Visualization 118
Producing Data Videos with R 118
More Sophisticated Videos 122
Statistical Modeling Without Models 122
What Is a Statistical Model Without Modeling? 123
How Does the Algorithm Work? 124
Source Code to Produce the Data Sets 125
Three Classes of Metrics: Centrality, Volatility, Bumpiness 125
Relationships Among Centrality, Volatility, and Bumpiness 125
Defining Bumpiness 126
Bumpiness Computation in Excel 127
Uses of Bumpiness Coefficients 128
Statistical Clustering for Big Data 129
Correlation and R-Squared for Big Data 130
A New Family of Rank Correlations 132
Asymptotic Distribution and Normalization 134
Computational Complexity 137
Computing q(n) 137
A Theoretical Solution 140
Structured Coefficient 140
Identifying the Number of Clusters 141
Methodology 142
Example 143
Internet Topology Mapping 143
Securing Communications: Data Encoding 147
Summary 149
Chapter 5 Data Science Craftsmanship, Part II 151
Data Dictionary 152
What Is a Data Dictionary? 152
Building a Data Dictionary 152
Hidden Decision Trees 153
Implementation 155
Example: Scoring Internet Traffic 156
Conclusion 158
Model-Free Confidence Intervals 158
Methodology 158
The Analyticbridge First Theorem 159
Application 160
Source Code 160
Random Numbers 161
Four Ways to Solve a Problem 163
Intuitive Approach for Business Analysts with Great Intuitive Abilities 164
Monte Carlo Simulations Approach for Software Engineers 165
Statistical Modeling Approach for Statisticians 165
Big Data Approach for Computer Scientists 165
Causation Versus Correlation 165
How Do You Detect Causes? 166
Life Cycle of Data Science Projects 168
Predictive Modeling Mistakes 171
Logistic-Related Regressions 172
Interactions Between Variables 172
First Order Approximation 172
Second Order Approximation 174
Regression with Excel 175
Experimental Design 176
Interesting Metrics 176
Segmenting the Patient Population 176
Customized Treatments 177
Analytics as a Service and APIs 178
How It Works 179
Example of Implementation 179
Source Code for Keyword Correlation API 180
Miscellaneous Topics 183
Preserving Scores When Data Sets Change 183
Optimizing Web Crawlers 184
Hash Joins 186
Simple Source Code to Simulate Clusters 186
New Synthetic Variance for Hadoop and Big Data 187
Introduction to Hadoop/MapReduce 187
Synthetic Metrics 188
Hadoop, Numerical, and Statistical Stability 189
The Abstract Concept of Variance 189
A New Big Data Theorem 191
Transformation-Invariant Metrics 192
Implementation: Communications
Versus Computational Costs 193
Final Comments 193
Summary 193
Chapter 6 Data Science Application Case Studies 195
Stock Market 195
Pattern to Boost Return by 500 Percent 195
Optimizing Statistical Trading Strategies 197
Stock Trading API: Statistical Model 200
Stock Trading API: Implementation 202
Stock Market Simulations 203
Some Mathematics 205
New Trends 208
Encryption 209
Data Science Application: Steganography 209
Solid E‑Mail Encryption 212
Captcha Hack 214
Fraud Detection 216
Click Fraud 216
Continuous Click Scores Versus Binary Fraud/Non-Fraud 218
Mathematical Model and Benchmarking 219
Bias Due to Bogus Conversions 220
A Few Misconceptions 221
Statistical Challenges 221
Click Scoring to Optimize Keyword Bids 222
Automated, Fast Feature Selection with Combinatorial Optimization 224
Predictive Power of a Feature: Cross-Validation 225
Association Rules to Detect Collusion and Botnets 228
Extreme Value Theory for Pattern Detection 229
Digital Analytics 230
Online Advertising: Formula for Reach and Frequency 231
E‑Mail Marketing: Boosting Performance by 300 Percent 231
Optimize Keyword Advertising Campaigns in 7 Days 232
Automated News Feed Optimization 234
Competitive Intelligence with Bit.ly 234
Measuring Return on Twitter Hashtags 237
Improving Google Search with Three Fixes 240
Improving Relevancy Algorithms 242
Ad Rotation Problem 244
Miscellaneous 245
Better Sales Forecasts with Simpler Models 245
Better Detection of Healthcare Fraud 247
Attribution Modeling 248
Forecasting Meteorite Hits 248
Data Collection at Trailhead Parking Lots 252
Other Applications of Data Science 253
Summary 253
Chapter 7 Launching Your New Data Science Career 255
Job Interview Questions 255
Questions About Your Experience 255
Technical Questions 257
General Questions 258
Questions About Data Science Projects 260
Testing Your Own Visual and Analytic Thinking 263
Detecting Patterns with the Naked Eye 263
Identifying Aberrations 266
Misleading Time Series and Random Walks 266
From Statistician to Data Scientist 268
Data Scientists Are Also Statistical Practitioners 268
Who Should Teach Statistics to Data Scientists? 269
Hiring Issues 269
Data Scientists Work Closely with Data Architects 270
Who Should Be Involved in Strategic Thinking? 270
Two Types of Statisticians 271
Using Big Data Versus Sampling 272
Taxonomy of a Data Scientist 273
Data Science’s Most Popular Skill Mixes 273
Top Data Scientists on LinkedIn 276
400 Data Scientist Job Titles 279
Salary Surveys 281
Salary Breakdown by Skill and Location 281
Create Your Own Salary Survey 285
Summary 285
Chapter 8 Data Science Resources 287
Professional Resources 287
Data Sets 288
Books 288
Conferences and Organizations 290
Websites 291
Definitions 292
Career-Building Resources 295
Companies Employing Data Scientists 296
Sample Data Science Job Ads 297
Sample Resumes 297
Summary 298
Index 299