Don't Miss

Visit Our Stores

Powell's Staff: Five Book Friday: In Memoriam (0 comment)

Every year, the booksellers at Powell’s submit their Top Fives: their five favorite books that were released in 2023. It’s a list that, when put together, shows just how varied and interesting the book tastes of Powell’s booksellers are. I highly recommend digging into the recommendations — we would never lead you astray — but today...

Brontez Purnell: Powell’s Q&A: Brontez Purnell, author of ‘Ten Bridges I’ve Burnt’ (0 comment)
Rachael P.: Starter Pack: Where to Begin with Ursula K. Le Guin (0 comment)

Natural Language Annotation for Machine Learning

by James Pustejovsky, Amber Stubbs

ISBN13: 9781449306663
ISBN10: 1449306667

All Product Details

$45.50

New Trade Paperback

Available at a Remote Warehouse. Ships separately from other items. Additional shipping charges may apply. Not available for In Store Pickup. More Info

Qty	Store
20	Remote Warehouse

Synopses & Reviews

Publisher Comments

Create your own natural language training corpus for machine learning. Whether youre working with English, Chinese, or any other natural language, this hands-on book guides you through a proven annotation development cycle—the process of adding metadata to your training corpus to help ML algorithms work more efficiently. You dont need any programming or linguistics experience to get started.

Using detailed examples at every step, youll learn how the MATTER Annotation Development Process helps you Model, Annotate, Train, Test, Evaluate, and Revise your training corpus. You also get a complete walkthrough of a real-world annotation project.

Define a clear annotation goal before collecting your dataset (corpus)
Learn tools for analyzing the linguistic content of your corpus
Build a model and specification for your annotation project
Examine the different annotation formats, from basic XML to the Linguistic Annotation Framework
Create a gold standard corpus that can be used to train and test ML algorithms
Select the ML algorithms that will process your annotated data
Evaluate the test results and revise your annotation task
Learn how to use lightweight software for annotating texts and adjudicating the annotations

This book is a perfect companion to OReillys Natural Language Processing with Python.

Synopsis

Create your own natural language training corpus for machine learning. This example-driven book walks you through the annotation cycle, from selecting an annotation task and creating the annotation specification to designing the guidelines, creating a "gold standard" corpus, and then beginning the actual data creation with the annotation process.

Systems exist for analyzing existing corpora, but making a new corpus can be extremely complex. To help you build a foundation for your own machine learning goals, this easy-to-use guide includes case studies that demonstrate four different annotation tasks in detail. Youll also learn how to use a lightweight software package for annotating texts and adjudicating the annotations.

This book is a perfect companion to O'Reillys Natural Language Processing with Python, which describes how to use existing corpora with the Natural Language Toolkit.

About the Author

James Pustejovsky teaches and does research in Artificial Intelligence and Computational Linguistics in the Computer Science Department at Brandeis University. His main areas of interest include: lexical meaning, computational semantics, temporal and spatial reasoning, and corpus linguistics. He is active in the development of standards for interoperability between language processing applications, and lead the creation of the recently adopted ISO standard for time annotation, ISO-TimeML. He is currently heading the development of a standard for annotating spatial information in language. More information on publications and research activities can be found at his webpage: pusto.com.

Preface; Natural Language Annotation for Machine Learning; Audience; Organization of This Book; Software Requirements; Conventions Used in This Book; Using Code Examples; Safari® Books Online; How to Contact Us; Acknowledgments; Chapter 1: The Basics; 1.1 The Importance of Language Annotation; 1.2 A Brief History of Corpus Linguistics; 1.3 Language Data and Machine Learning; 1.4 The Annotation Development Cycle; 1.5 Summary; Chapter 2: Defining Your Goal and Dataset; 2.1 Defining Your Goal; 2.2 Background Research; 2.3 Assembling Your Dataset; 2.4 The Size of Your Corpus; 2.5 Summary; Chapter 3: Corpus Analytics; 3.1 Basic Probability for Corpus Analytics; 3.2 Counting Occurrences; 3.3 Language Models; 3.4 Summary; Chapter 4: Building Your Model and Specification; 4.1 Some Example Models and Specs; 4.2 Adopting (or Not Adopting) Existing Models; 4.3 Different Kinds of Standards; 4.4 Summary; Chapter 5: Applying and Adopting Annotation Standards; 5.1 Metadata Annotation: Document Classification; 5.2 Text Extent Annotation: Named Entities; 5.3 Linked Extent Annotation: Semantic Roles; 5.4 ISO Standards and You; 5.5 Summary; Chapter 6: Annotation and Adjudication; 6.1 The Infrastructure of an Annotation Project; 6.2 Specification Versus Guidelines; 6.3 Be Prepared to Revise; 6.4 Preparing Your Data for Annotation; 6.5 Writing the Annotation Guidelines; 6.6 Annotators; 6.7 Choosing an Annotation Environment; 6.8 Evaluating the Annotations; 6.9 Creating the Gold Standard (Adjudication); 6.10 Summary; Chapter 7: Training: Machine Learning; 7.1 What Is Learning?; 7.2 Defining Our Learning Task; 7.3 Classifier Algorithms; 7.4 Sequence Induction Algorithms; 7.5 Clustering and Unsupervised Learning; 7.6 Semi-Supervised Learning; 7.7 Matching Annotation to Algorithms; 7.8 Summary; Chapter 8: Testing and Evaluation; 8.1 Testing Your Algorithm; 8.2 Evaluating Your Algorithm; 8.3 Problems That Can Affect Evaluation; 8.4 Final Testing Scores; 8.5 Summary; Chapter 9: Revising and Reporting; 9.1 Revising Your Project; 9.2 Reporting About Your Work; 9.3 Summary; Chapter 10: Annotation: TimeML; 10.1 The Goal of TimeML; 10.2 Related Research; 10.3 Building the Corpus; 10.4 Model: Preliminary Specifications; 10.5 Annotation: First Attempts; 10.6 Model: The TimeML Specification Used in TimeBank; 10.7 Annotation: The Creation of TimeBank; 10.8 TimeML Becomes ISO-TimeML; 10.9 Modeling the Future: Directions for TimeML; 10.10 Summary; Chapter 11: Automatic Annotation: Generating TimeML; 11.1 The TARSQI Components; 11.2 Improvements to the TTK; 11.3 TimeML Challenges: TempEval-2; 11.4 Future of the TTK; 11.5 Summary; Chapter 12: Afterword: The Future of Annotation; 12.1 Crowdsourcing Annotation; 12.2 Handling Big Data; 12.3 NLP Online and in the Cloud; 12.4 And Finally...; List of Available Corpora and Specifications; Corpora; Specifications, Guidelines, and Other Resources; Representation Standards; List of Software Resources; Annotation and Adjudication Software; Machine Learning Resources; MAE User Guide; Installing and Running MAE; Loading Tasks and Files; Saving Files; Defining Your Own Task; Frequently Asked Questions; MAI User Guide; Installing and Running MAI; Loading Tasks and Files; Adjudicating; Saving Files; Bibliography; References for Using Amazon's Mechanical Turk/Crowdsourcing; Colophon;

What Our Readers Are Saying

Be the first to share your thoughts on this title!

Product Details

ISBN:: 9781449306663
Binding:: Trade Paperback
Publication date:: 12/04/2012
Publisher:: O'Reilly Media
Pages:: 339
Height:: .73IN
Width:: 7.07IN
Thickness:: .75
Illustration:: Yes
Author:: Amber Stubbs
Author:: James Pustejovsky
Author:: Pustejovsky
Subject:: Database design