Fundamentals of NLP - Chapter 1 - Tokenization, Lemmatization, Stemming, and Sentence Segmentation
The first chapter of the fundamental of NLP series.
Introduction
Author: Elvis Saravia ( Twitter | LinkedIn)
The full project will be maintained here.
Natural language processing (NLP) has made substantial advances in the past few years due to the success of modern techniques that are based on deep learning. With the rise of the popularity of NLP and the availability of different forms of large-scale data, it is now even more imperative to understand the inner workings of NLP techniques and concepts, from first principles, as they find their way into real-world usage and applications that affect society at large. Building intuitions and having a solid grasp of concepts are both important for coming up with innovative techniques, improving research, and building safe, human-centered AI and NLP technologies.
In this first chapter, which is part of a series called Fundamentals of NLP, we will learn about some of the most important basic concepts that power NLP techniques used for research and building real-world applications. Some of these techniques include lemmatization, stemming, tokenization, and sentence segmentation. These are all important techniques to train efficient and effective NLP models. Along the way, we will also cover best practices and common mistakes to avoid when training and building NLP models. We also provide some exercises for you to keep practicing and exploring some ideas.
In every chapter, we will introduce the theoretical aspect and motivation of each concept covered. Then we will obtain hands-on experience by using bootstrap methods, industry-standard tools, and other open-source libraries to implement the different techniques. Along the way, we will also cover best practices, share important references, point out common mistakes to avoid when training and building NLP models, and discuss what lies ahead.
Tokenization
With any typical NLP task, one of the first steps is to tokenize your pieces of text into its individual words/tokens (process demonstrated in the figure above), the result of which is used to create so-called vocabularies that will be used in the langauge model you plan to build. This is actually one of the techniques that we will use the most throughout this series but here we stick to the basics.
Below I am showing you an example of a simple tokenizer without any following any standards. All it does is extract tokens based on a white space seperator.
Try to running the following code blocks.
## required libraries that need to be installed
%%capture
!pip install -U spacy
!pip install -U spacy-lookups-data
!python -m spacy download en_core_web_sm
## tokenizing a piecen of text
doc = "I love coding and writing"
for i, w in enumerate(doc.split(" ")):
print("Token " + str(i) + ": " + w)
All the code does is separate the sentence into individual tokens. The above simple block of code works well on the text I have provided. But typically, text is a lot noisier and complex than the example I used. For instance, if I used the word "so-called" is that one word or two words? For such scenarios, you may need more advanced approaches for tokenization. You can consider stripping away the "-" and splitting into two tokens or just combining into one token but this all depends on the problem and domain you are working on.
Another problem with our simple algorithm is that it cannot deal with extra whitespaces in the text. In addition, how do we deal with cities like "New York" and "San Francisco"?
Exercise 1: Copy the code from above and add extra whitespaces to the string value assigned to the doc
variable and identify the issue with the code. Then try to fix the issue. Hint: Use text.strip()
to fix the problem.
### ENTER CODE HERE
###
Tokenization can also come in different forms. For instance, more recently a lot of state-of-the-art NLP models such as BERT make use of subword
tokens in which frequent combinations of characters also form part of the vocabulary. This helps to deal with the so-called out of vocabulary (OOV) problem. We will discuss this in upcoming chapters, but if you are interested in reading more about this now, check this paper.
To demonstrate how you can achieve more reliable tokenization, we are going to use spaCy, which is an impressive and robust Python library for natural language processing. In particular, we are going to use the built-in tokenizer found here.
Run the code block below.
## import the libraries
import spacy
## load the language model
nlp = spacy.load("en_core_web_sm")
## tokenization
doc = nlp("This is the so-called lemmatization")
for token in doc:
print(token.text)
All the code does is tokenize the text based on a pre-built language model.
Try putting different running text into the nlp()
part of the code above. The tokenizer is quiet robust and it includes a series of built-in rules that deal with exceptions and special cases such as those tokens that contain puctuations like "`" and ".", "-", etc. You can even add your own rules, find out how here.
In a later chapter of the series, we will do a deep dive on tokenization and the different tools that exist out there that can simplify and speed up the process of tokenization to build vocabularies. Some of the tools we will explore are the Keras Tokenizer API and Hugging Face Tokenizer.
Lemmatization
Lemmatization is the process where we take individual tokens from a sentence and we try to reduce them to their base form. The process that makes this possible is having a vocabulary and performing morphological analysis to remove inflectional endings. The output of the lemmatization process (as shown in the figure above) is the lemma or the base form of the word. For instance, a lemmatization process reduces the inflections, "am", "are", and "is", to the base form, "be". Take a look at the figure above for a full example and try to understand what it's doing.
Lemmatization is helpful for normalizing text for text classification tasks or search engines, and a variety of other NLP tasks such as sentiment classification. It is particularly important when dealing with complex languages like Arabic and Spanish.
To show how you can achieve lemmatization and how it works, we are going to use spaCy again. Using the spaCy Lemmatizer class, we are going to convert a few words into their lemmas.
Below I show an example of how to lemmatize a sentence using spaCy. Try to run the block of code below and inspect the results.
## import the libraries
from spacy.lemmatizer import Lemmatizer
from spacy.lookups import Lookups
## lemmatization
doc = nlp(u'I love coding and writing')
for word in doc:
print(word.text, "=>", word.lemma_)
The results above look as expected. The only lemma that looks off is the -PRON-
returned for the "I" token. According to the spaCy documentation, "This is in fact expected behavior and not a bug. Unlike verbs and common nouns, there’s no clear base form of a personal pronoun. Should the lemma of “me” be “I”, or should we normalize person as well, giving “it” — or maybe “he”? spaCy’s solution is to introduce a novel symbol, -PRON-, which is used as the lemma for all personal pronouns."
Check out more about this in the spaCy documentation.
Exercise 2: Try the code above with different sentences and see if you get any unexpected results. Also, try adding punctuations and extra whitespaces which are more common in natural language. What happens?
### ENTER CODE HERE
###
We can also create our own custom lemmatizer as shown below (code adopted directly from the spaCy website):
## lookup tables
lookups = Lookups()
lookups.add_table("lemma_rules", {"noun": [["s", ""]]})
lemmatizer = Lemmatizer(lookups)
words_to_lemmatize = ["cats", "brings", "sings"]
for w in words_to_lemmatize:
lemma = lemmatizer(w, "NOUN")
print(lemma)
In the example code above, we added one lemma rule, which aims to identify plural nouns and remove the plurality, i.e. remove the "s". There are different types of rules you can add here. I encourage you to head over to the spaCy documentation to learn a bit more.
Stemming
Stemming is just a simpler version of lemmatization where we are interested in stripping the suffix at the end of the word. When stemming we are interesting in reducing the inflected or derived word to it's base form. Take a look at the figure above to get some intuition about the process.
Both the stemming and the lemmatization processes involve morphological analysis) where the stems and affixes (called the morphemes) are extracted and used to reduce inflections to their base form. For instance, the word cats has two morphemes, cat and s, the cat being the stem and the s being the affix representing plurality.
spaCy doesn't support stemming so for this part we are going to use NLTK, which is another fantastic Python NLP library.
The simple example below demonstrates how you can stem words in a piece of text. Go ahead and run the code to see what happens.
from nltk.stem.snowball import SnowballStemmer
stemmer = SnowballStemmer(language='english')
doc = 'I prefer not to argue'
for token in doc.split(" "):
print(token, '=>' , stemmer.stem(token))
Notice how the stemmed version of the word "argue" is "argu". That's because we can have derived words like "argument", "arguing", and "argued".
Exercise 3: Try to use different sentences in the code above and observe the effect of the stemmer. By the way, there are other stemmers such as the Porter stemmer in the NLTK library. Each stemmer behaves differently so the output may vary. Feel free to try the Porter stemmer from the NLTK library and inspect the output of the different stemmers.
### ENTER CODE HERE
###
Sentence Segmentation
When dealing with text, it is always common that we need to break up text into its individual sentences. That is what is known as sentence segmentation: the process of obtaining the individual sentences from a text corpus. The resulting segments can then be analyzed individually with the techniques that we previously learned.
In the spaCy library, we have the choice to use a built-in sentence segmenter (trained on statistical models) or build your own rule-based method. In fact, we will cover a few examples to demonstrate the difficultiness of this problem.
Below I created a naive implementation of a sentence segmentation algorithm without using any kind of special library. You can see that my code increases with complexity (bugs included) as I start to consider more rules. This sort of boostrapped or rule-based approach is sometimes your only option depending on the language you are working with or the availability of linguistic resources.
Run the code below to apply a simple algorithm for sentence segmentation.
## using a simple rule-based segmenter with native python code
text = "I love coding and programming. I also love sleeping!"
current_position = 0
cursor = 0
sentences = []
for c in text:
if c == "." or c == "!":
sentences.append(text[current_position:cursor+1])
current_position = cursor + 2
cursor+=1
print(sentences)
Our sentence segmenter only segments sentences when it meets a sentence boundary which in this case is either a "." or a "!". It's not the cleanest of code but it shows how difficult the task can get as we are presented with richer text that include more diverse special characters. One problem with my code is that I am not able to differentiate between abbreviations like Dr. and numbers like 0.4. You may be able to create your own complex regular expression (we will get into this in the second chapter) to deal with these special cases but it still requires a lot of work and debugging. Luckily for us, there are libraries like spaCy and NLTK which help with this sort of preprocessing tasks.
Let's try the sentence segmentation provided by spaCy. Run the code below and inspect the results.
doc = nlp("I love coding and programming. I also love sleeping!")
for sent in doc.sents:
print(sent.text)
Here is a link showing how you can create your own rule-based strategy for sentence segmentation using spaCy. This is particulary useful if you are working with domain-specific text which is full with noisy information and is not as standardized as text found on a factual Wiki page or news website.
Exercise 4: For practise, try to create your own sentence segmentation algorithm using spaCy (try this link for help and ideas). At this point, I am encouraging you to look at documentation which is a huge part of learning in-depth about all the concepts we will cover in this series. Research is a huge part of the learning process.
### ENTER CODE HERE
###
How to use with Machine Learning?
When you are working with textual information, it is imperative to clean your data so as to be able to train more accurate machine learning (ML) models.
One of the reasons why transformations like lemmatization and stemming are useful is for normalizing the text before you feed the output to an ML algorithm. For instance, if you are building a sentiment analysis model how can you tell the model that "smiling" and "smile" refer to the same concept? You may require stemming if you are using TF-IDF features combined with a machine learning algorithm such as Naive Bayes classifier. As you may suspect already, this also requires a really good tokenizer to come up with the features, especially when work on noisy pieces of text that could be generated from users in a social media site.
With a wide variety of NLP tasks, one of the first big steps in the NLP pipeline is to create a vocabulary that will eventually be used to determine the inputs for the model representing the features. In modern NLP techniques such as pretrained language models, you need to process a text corpus that require proper and more sophisticated sentence segmentation and tokenization as we discussed before. We will talk more about these methods in due time. For now, the basics presented here are a good start into the world of practical NLP. Spend some time reading up on all the concepts mentioned here and take notes. I will guide through the series on what are the important parts and provide you with relevant links but you can also conduct your own additional research on the side and even improve this notebook.
Final Words and What's Next?
In this chapter we learned some fundamental concepts of NLP such as lemmatization, stemming, sentence segmentations, and tokenization. In the next chapter we will cover topics such as word normalization, regular expressions, part of speech and edit distance, all very important topics when working with information retrieval and NLP systems.