Fundamentals of NLP (Chapter 1): Tokenization, Lemmatization, Stemming, and Sentence Segmentation

Natural language processing (NLP) has made substantial advances in the past few years due to the success of modern techniques that are based on deep learning. With the rise of the popularity of NLP and the availability of different forms of large-scale data, it is now even more imperative to understand the inner workings of NLP techniques and concepts, from first principles, as they find their way into real-world usage and applications that affect society at large. Building intuitions and having a solid grasp of concepts are both important for coming up with innovative techniques, improving research, and building safe, human-centered AI and NLP technologies.

In this first chapter, which is part of a series called Fundamentals of NLP, we will learn about some of the most important basic concepts that power NLP techniques used for research and building real-world applications. Some of these techniques include lemmatization, stemming, tokenization, and sentence segmentation. These are all important techniques to train efficient and effective NLP models. Along the way, we will also cover best practices and common mistakes to avoid when training and building NLP models. We also provide some exercises for you to keep practicing and exploring some ideas.

In every chapter, we will introduce the theoretical aspect and motivation of each concept covered. Then we will obtain hands-on experience by using bootstrap methods, industry-standard tools, and other open-source libraries to implement the different techniques. Along the way, we will also cover best practices, share important references, point out common mistakes to avoid when training and building NLP models, and discuss what lies ahead.

The project will be maintained on GitHub.

Get access to the first chapter at this Colab notebook.

Elvis Saravia

Fundamentals of NLP (Chapter 1): Tokenization, Lemmatization, Stemming, and Sentence Segmentation

You might also enjoy (View all posts)