Synopses & Reviews
The Internet is bigger and better than what a mere browser allows. Webbots, Spiders, and Screen Scrapers is for programmers and businesspeople who want to take full advantage of the vast resources available on the Web. There's no reason to let browsers limit your online experienceespecially when you can easily automate online tasks to suit your individual needs.
Learn how to write webbots and spiders that do all this and more:
- Programmatically download entire websites
- Effectively parse data from web pages
- Manage cookies
- Decode encrypted files
- Automate form submissions
- Send and receive email
- Send SMS alerts to your cell phone
- Unlock password-protected websites
- Automatically bid in online auctions
- Exchange data with FTP and NNTP servers
Sample projects using standard code libraries reinforce these new skills. You'll learn how to create your own webbots and spiders that track online prices, aggregate different data sources into a single web page, and archive the online data you just can't live without. You'll learn inside information from an experienced webbot developer on how and when to write stealthy webbots that mimic human behavior, tips for developing fault-tolerant designs, and various methods for launching and scheduling webbots. You'll also get advice on how to write webbots and spiders that respect website owner property rights, plus techniques for shielding websites from unwanted robots.
This second edition of Webbots, Spiders, and Screen Scrapers has been completely updated and revised to cover the latest trends in web crawling, including new chapters on text parsing, browser macros, anonymizers, and more.
Some tasks are just too tediousor too important!to leave to humans. Once you've automated your online life, you'll never let a browser limit the way you use the Internet again.
The Internet is bigger and better than what a mere browser allows. Webbots, Spiders, and Screen Scrapers is for programmers and businesspeople who want to take full advantage of the vast resources available on the Web. The book first outlines the deficiencies of browsers, and then explains how these deficiencies can be exploited in the design and deployment of task-specific webbots.
As they follow along, readers learn how to write stealthy webbots that send and receive email and text messages, manage cookies, and decode encrypted files. Sample projects reinforce these new skills so that readers can create more sophisticated bots to track online prices, download entire websites, and bid on auctions in their closing moments. This second edition of Webbots, Spiders, and Screen Scrapers has been completely updated and revised to cover the latest trends in web crawling, including new chapters on text parsing, browser macros, anonymizers, and more.
There's a wealth of data online, but sorting and gathering it by hand can be tedious and time consuming. Rather than click through page after endless page, why not let bots do the work for you?
Webbots, Spiders, and Screen Scrapers will show you how to create simple programs with PHP/CURL to mine, parse, and archive online data to help you make informed decisions. Michael Schrenk, a highly regarded webbot developer, teaches you how to develop fault-tolerant designs, how best to launch and schedule the work of your bots, and how to create Internet agents that:
- Send email or SMS notifications to alert you to new information quickly
- Search different data sources and combine the results on one page, making the data easier to interpret and analyze
- Automate purchases, auction bids, and other online activities to save time
Sample projects for automating tasks like price monitoring and news aggregation will show you how to put the concepts you learn into practice.
This second edition of Webbots, Spiders, and Screen Scrapers includes tricks for dealing with sites that are resistant to crawling and scraping, writing stealthy webbots that mimic human search behavior, and using regular expressions to harvest specific data. As you discover the possibilities of web scraping, you'll see how webbots can save you precious time and give you much greater control over the data available on the Web.
About the Author
Michael Schrenk uses webbots and data-driven web applications to create competitive advantages for businesses. He has written for Computerworld and Web Techniques magazines and has taught courses on Web usability and Internet marketing. He has also given presentations on intelligent Web agents and online corporate intelligence at the DEFCON hacker's convention.
Table of Contents
; About the Author; About the Technical Reviewer; Acknowledgments; Introduction; Old-School Client-Server Technology; The Problem with Browsers; What to Expect from This Book; About the Website; About the Code; Requirements; A Disclaimer (This Is Important); Fundamental Concepts and Techniques; Chapter 1: What's in It for You?; 1.1 Uncovering the Internet's True Potential; 1.2 What's in It for Developers?; 1.3 What's in It for Business Leaders?; 1.4 Final Thoughts; Chapter 2: Ideas for Webbot Projects; 2.1 Inspiration from Browser Limitations; 2.2 A Few Crazy Ideas to Get You Started; 2.3 Final Thoughts; Chapter 3: Downloading Web Pages; 3.1 Think About Files, Not Web Pages; 3.2 Downloading Files with PHP's Built-in Functions; 3.3 Introducing PHP/CURL; 3.4 Installing PHP/CURL; 3.5 LIB_http; 3.6 Final Thoughts; Chapter 4: Basic Parsing Techniques; 4.1 Content Is Mixed with Markup; 4.2 Parsing Poorly Written HTML; 4.3 Standard Parse Routines; 4.4 Using LIB_parse; 4.5 Useful PHP Functions; 4.6 Final Thoughts; Chapter 5: Advanced Parsing with Regular Expressions; 5.1 Pattern Matching, the Key to Regular Expressions; 5.2 PHP Regular Expression Types; 5.3 Learning Patterns Through Examples; 5.4 Regular Expressions of Particular Interest to Webbot Developers; 5.5 When Regular Expressions Are (or Aren't) the Right Parsing Tool; 5.6 Final Thoughts; Chapter 6: Automating Form Submission; 6.1 Reverse Engineering Form Interfaces; 6.2 Form Handlers, Data Fields, Methods, and Event Triggers; 6.3 Unpredictable Forms; 6.4 Analyzing a Form; 6.5 Final Thoughts; Chapter 7: Managing Large Amounts of Data; 7.1 Organizing Data; 7.2 Making Data Smaller; 7.3 Thumbnailing Images; 7.4 Final Thoughts; Projects; Chapter 8: Price-Monitoring Webbots; 8.1 The Target; 8.2 Designing the Parsing Script; 8.3 Initialization and Downloading the Target; 8.4 Further Exploration; Chapter 9: Image-Capturing Webbots; 9.1 Example Image-Capturing Webbot; 9.2 Creating the Image-Capturing Webbot; 9.3 Further Exploration; 9.4 Final Thoughts; Chapter 10: Link-Verification Webbots; 10.1 Creating the Link-Verification Webbot; 10.2 Running the Webbot; 10.3 Further Exploration; Chapter 11: Search-Ranking Webbots; 11.1 Description of a Search Result Page; 11.2 What the Search-Ranking Webbot Does; 11.3 Running the Search-Ranking Webbot; 11.4 How the Search-Ranking Webbot Works; 11.5 The Search-Ranking Webbot Script; 11.6 Final Thoughts; 11.7 Further Exploration; Chapter 12: Aggregation Webbots; 12.1 Choosing Data Sources for Webbots; 12.2 Example Aggregation Webbot; 12.3 Adding Filtering to Your Aggregation Webbot; 12.4 Further Exploration; Chapter 13: FTP Webbots; 13.1 Example FTP Webbot; 13.2 PHP and FTP; 13.3 Further Exploration; Chapter 14: Webbots That Read Email; 14.1 The POP3 Protocol; 14.2 Executing POP3 Commands with a Webbot; 14.3 Further Exploration; Chapter 15: Webbots That Send Email; 15.1 Email, Webbots, and Spam; 15.2 Sending Mail with SMTP and PHP; 15.3 Writing a Webbot That Sends Email Notifications; 15.4 Further Exploration; Chapter 16: Converting a Website into a Function; 16.1 Writing a Function Interface; 16.2 Final Thoughts; Advanced Technical Considerations; Chapter 17: Spiders; 17.1 How Spiders Work; 17.2 Example Spider; 17.3 LIB_simple_spider; 17.4 Experimenting with the Spider; 17.5 Adding the Payload; 17.6 Further Exploration; Chapter 18: Procurement Webbots and Snipers; 18.1 Procurement Webbot Theory; 18.2 Sniper Theory; 18.3 Testing Your Own Webbots and Snipers; 18.4 Further Exploration; 18.5 Final Thoughts; Chapter 19: Webbots and Cryptography; 19.1 Designing Webbots That Use Encryption; 19.2 A Quick Overview of Web Encryption; 19.3 Final Thoughts; Chapter 20: Authentication; 20.1 What Is Authentication?; 20.2 Example Scripts and Practice Pages; 20.3 Basic Authentication; 20.4 Session Authentication; 20.5 Final Thoughts; Chapter 21: Advanced Cookie Management; 21.1 How Cookies Work; 21.2 PHP/CURL and Cookies; 21.3 How Cookies Challenge Webbot Design; 21.4 Further Exploration; Chapter 22: Scheduling Webbots and Spiders; 22.1 Preparing Your Webbots to Run as Scheduled Tasks; 22.2 The Windows XP Task Scheduler; 22.3 The Windows 7 Task Scheduler; 22.4 Non-calendar-based Triggers; 22.5 Final Thoughts; Chapter 23: Scraping Difficult Websites with Browser Macros; 23.1 Barriers to Effective Web Scraping; 23.2 Overcoming Webscraping Barriers with Browser Macros; 23.3 Final Thoughts; Chapter 24: Hacking iMacros; 24.1 Hacking iMacros for Added Functionality; 24.2 Further Exploration; Chapter 25: Deployment and Scaling; 25.1 One-to-Many Environment; 25.2 One-to-One Environment; 25.3 Many-to-Many Environment; 25.4 Many-to-One Environment; 25.5 Scaling and Denial-of-Service Attacks; 25.6 Creating Multiple Instances of a Webbot; 25.7 Managing a Botnet; 25.8 Further Exploration; Larger Considerations; Chapter 26: Designing Stealthy Webbots and Spiders; 26.1 Why Design a Stealthy Webbot?; 26.2 Stealth Means Simulating Human Patterns; 26.3 Final Thoughts; Chapter 27: Proxies; 27.1 What Is a Proxy?; 27.2 Proxies in the Virtual World; 27.3 Why Webbot Developers Use Proxies; 27.4 Using a Proxy Server; 27.5 Types of Proxy Servers; 27.6 Final Thoughts; Chapter 28: Writing Fault-Tolerant Webbots; 28.1 Types of Webbot Fault Tolerance; 28.2 Error Handlers; 28.3 Further Exploration; Chapter 29: Designing Webbot-Friendly Websites; 29.1 Optimizing Web Pages for Search Engine Spiders; 29.2 Web Design Techniques That Hinder Search Engine Spiders; 29.3 Designing Data-Only Interfaces; 29.4 Final Thoughts; Chapter 30: Killing Spiders; 30.1 Asking Nicely; 30.2 Building Speed Bumps; 30.3 Setting Traps; 30.4 Final Thoughts; Chapter 31: Keeping Webbots out of Trouble; 31.1 It's All About Respect; 31.2 Copyright; 31.3 Trespass to Chattels; 31.4 Internet Law; 31.5 Final Thoughts; PHP/CURL Reference; Creating a Minimal PHP/CURL Session; Initiating PHP/CURL Sessions; Setting PHP/CURL Options; Executing the PHP/CURL Command; Closing PHP/CURL Sessions; Status Codes; HTTP Codes; NNTP Codes; SMS Gateways; Sending Text Messages; Reading Text Messages; A Sampling of Text Message Email Addresses;