Synopses & Reviews
Learn web scraping and crawling techniques to access unlimited data from any web source in any format. With this practical guide, youll learn how to use Python scripts and web APIs to gather and process data from thousands—or even millions—of web pages at once.
Ideal for programmers, security professionals, and web administrators familiar with Python, this book not only teaches basic web scraping mechanics, but also delves into more advanced topics, such as analyzing raw data or using scrapers for frontend website testing. Code samples are available to help you understand the concepts in practice.
- Learn how to parse complicated HTML pages
- Traverse multiple pages and sites
- Get a general overview of APIs and how they work
- Learn several methods for storing the data you scrape
- Download, read, and extract data from documents
- Use tools and techniques to clean badly formatted data
- Read and write natural languages
- Crawl through forms and logins
- Understand how to scrape JavaScript
- Learn image processing and text recognition
Synopsis
Want to freely access unlimited data from any web source, in any format? Automated gathering and manipulation of data from across the web helped launch Facebook in its early days, and is the foundation of Google's search engine today. With this book, youll learn how to gather unlimited data from any web source and use it for your own studies or web applications.
Web scraping is a technology nearly as old as the web itself, but the techniques used must keep pace with web technologies in order to remain viable. Web Scraping with Python not only teaches you the basics of web scraping, but also gets you up to speed on cutting-edge security and technology considerations in one comprehensive guide.
- Learn what web scraping is and why its useful
- Understand the legalities of web scraping
- Create basic scrapers and more complicated crawlers
- Apply advanced HTML parsing with JSoup/BeautifulSoup
- Use scrapers to test your own site
- Navigate security challenges and tricky sites
About the Author
Ryan Mitchell is a Software Engineer at LinkeDrive in Boston, where she develops their API and data analysis tools. She is a graduate of Olin College of Engineering, and is a Masters degree student at Harvard University School of Extension Studies. Prior to joining LinkeDrive, she was a Software Engineer working on web scraping and data analysis at Abine.