Synopses & Reviews
Create your own natural language training corpus for machine learning. Whether youre working with English, Chinese, or any other natural language, this hands-on book guides you through a proven annotation development cycle—the process of adding metadata to your training corpus to help ML algorithms work more efficiently. You dont need any programming or linguistics experience to get started.
Using detailed examples at every step, youll learn how the MATTER Annotation Development Process helps you Model, Annotate, Train, Test, Evaluate, and Revise your training corpus. You also get a complete walkthrough of a real-world annotation project.
- Define a clear annotation goal before collecting your dataset (corpus)
- Learn tools for analyzing the linguistic content of your corpus
- Build a model and specification for your annotation project
- Examine the different annotation formats, from basic XML to the Linguistic Annotation Framework
- Create a gold standard corpus that can be used to train and test ML algorithms
- Select the ML algorithms that will process your annotated data
- Evaluate the test results and revise your annotation task
- Learn how to use lightweight software for annotating texts and adjudicating the annotations
This book is a perfect companion to OReillys Natural Language Processing with Python.
Synopsis
Create your own natural language training corpus for machine learning. This example-driven book walks you through the annotation cycle, from selecting an annotation task and creating the annotation specification to designing the guidelines, creating a "gold standard" corpus, and then beginning the actual data creation with the annotation process.
Systems exist for analyzing existing corpora, but making a new corpus can be extremely complex. To help you build a foundation for your own machine learning goals, this easy-to-use guide includes case studies that demonstrate four different annotation tasks in detail. Youll also learn how to use a lightweight software package for annotating texts and adjudicating the annotations.
This book is a perfect companion to O'Reillys Natural Language Processing with Python, which describes how to use existing corpora with the Natural Language Toolkit.
About the Author
James Pustejovsky teaches and does research in Artificial Intelligence and Computational Linguistics in the Computer Science Department at Brandeis University. His main areas of interest include: lexical meaning, computational semantics, temporal and spatial reasoning, and corpus linguistics. He is active in the development of standards for interoperability between language processing applications, and lead the creation of the recently adopted ISO standard for time annotation, ISO-TimeML. He is currently heading the development of a standard for annotating spatial information in language. More information on publications and research activities can be found at his webpage: pusto.com.
Table of Contents
Preface; Natural Language Annotation for Machine Learning; Audience; Organization of This Book; Software Requirements; Conventions Used in This Book; Using Code Examples; Safari® Books Online; How to Contact Us; Acknowledgments; Chapter 1: The Basics; 1.1 The Importance of Language Annotation; 1.2 A Brief History of Corpus Linguistics; 1.3 Language Data and Machine Learning; 1.4 The Annotation Development Cycle; 1.5 Summary; Chapter 2: Defining Your Goal and Dataset; 2.1 Defining Your Goal; 2.2 Background Research; 2.3 Assembling Your Dataset; 2.4 The Size of Your Corpus; 2.5 Summary; Chapter 3: Corpus Analytics; 3.1 Basic Probability for Corpus Analytics; 3.2 Counting Occurrences; 3.3 Language Models; 3.4 Summary; Chapter 4: Building Your Model and Specification; 4.1 Some Example Models and Specs; 4.2 Adopting (or Not Adopting) Existing Models; 4.3 Different Kinds of Standards; 4.4 Summary; Chapter 5: Applying and Adopting Annotation Standards; 5.1 Metadata Annotation: Document Classification; 5.2 Text Extent Annotation: Named Entities; 5.3 Linked Extent Annotation: Semantic Roles; 5.4 ISO Standards and You; 5.5 Summary; Chapter 6: Annotation and Adjudication; 6.1 The Infrastructure of an Annotation Project; 6.2 Specification Versus Guidelines; 6.3 Be Prepared to Revise; 6.4 Preparing Your Data for Annotation; 6.5 Writing the Annotation Guidelines; 6.6 Annotators; 6.7 Choosing an Annotation Environment; 6.8 Evaluating the Annotations; 6.9 Creating the Gold Standard (Adjudication); 6.10 Summary; Chapter 7: Training: Machine Learning; 7.1 What Is Learning?; 7.2 Defining Our Learning Task; 7.3 Classifier Algorithms; 7.4 Sequence Induction Algorithms; 7.5 Clustering and Unsupervised Learning; 7.6 Semi-Supervised Learning; 7.7 Matching Annotation to Algorithms; 7.8 Summary; Chapter 8: Testing and Evaluation; 8.1 Testing Your Algorithm; 8.2 Evaluating Your Algorithm; 8.3 Problems That Can Affect Evaluation; 8.4 Final Testing Scores; 8.5 Summary; Chapter 9: Revising and Reporting; 9.1 Revising Your Project; 9.2 Reporting About Your Work; 9.3 Summary; Chapter 10: Annotation: TimeML; 10.1 The Goal of TimeML; 10.2 Related Research; 10.3 Building the Corpus; 10.4 Model: Preliminary Specifications; 10.5 Annotation: First Attempts; 10.6 Model: The TimeML Specification Used in TimeBank; 10.7 Annotation: The Creation of TimeBank; 10.8 TimeML Becomes ISO-TimeML; 10.9 Modeling the Future: Directions for TimeML; 10.10 Summary; Chapter 11: Automatic Annotation: Generating TimeML; 11.1 The TARSQI Components; 11.2 Improvements to the TTK; 11.3 TimeML Challenges: TempEval-2; 11.4 Future of the TTK; 11.5 Summary; Chapter 12: Afterword: The Future of Annotation; 12.1 Crowdsourcing Annotation; 12.2 Handling Big Data; 12.3 NLP Online and in the Cloud; 12.4 And Finally...; List of Available Corpora and Specifications; Corpora; Specifications, Guidelines, and Other Resources; Representation Standards; List of Software Resources; Annotation and Adjudication Software; Machine Learning Resources; MAE User Guide; Installing and Running MAE; Loading Tasks and Files; Saving Files; Defining Your Own Task; Frequently Asked Questions; MAI User Guide; Installing and Running MAI; Loading Tasks and Files; Adjudicating; Saving Files; Bibliography; References for Using Amazon's Mechanical Turk/Crowdsourcing; Colophon;