Data comes all shapes and sizes: time stamps, sensor readings, images, categorical labels, and so many more. A considerable amount of data is text and is deemed most valuable type of data for the capable of understanding it, for the purpose of Natural Language Processing (NLP), we will be using a famous library caller spaCy to take on some of the most important tasks in working with text.

By the end, you will be able to use spaCy for:

  • Basic text processing and pattern matching
  • Building machine learning models with text
  • Representing text with word embeddings that numerically capture the meaning of words and documents

To understand this post fully, you’ll need some experience with machine learning. If you don’t have experience with scikit-learn, then you will have to wait for my tutorial on that, but hang on as it is pretty straight forward.

Natural Language Processing with spaCy

spaCy is the leading library for NLP, and it has quickly become one of the most popular Python frameworks. Most people find it intuitive, and it has excellent documentation.

Installing spaCy

We can install spacy by having the following command

pip install spacy

Loading the models

spaCy relies on models that are language-specific and come in different sizes. You can load a spaCy model with spacy.load.

For example, here’s how you would load the English language model.

import spacy
nlp = spacy.load('en')

With the model loaded, you can process text like this:In [2]:

doc = nlp("Tea is healthy and calming, don't you think?")

There’s a lot you can do with the doc object you just created.

Tokenizing

in to begin with natural language processing we need to tokenize the text. A token is a unit of text in the document, such as individual words and punctuation. SpaCy splits contractions like “don’t” into two tokens, “do” and “n’t”. You can see the tokens by iterating through the document.

for token in doc:
    print(token)
Tea
is
healthy
and
calming
,
do
n't
you
think
?

going through a document you can get all the token objects. Each of these tokens comes with additional information. In most cases, the important ones are token.lemma_ and token.is_stop.

Text preprocessing

There are a few types of preprocessing to improve how we model with words before we attempt natural language processing. The first is “lemmatizing.” The “lemma” of a word is its base form. For example, “laughing” is the lemma of the word “laugh”. So, when you lemmatize the word laughing, you would convert it to laugh.

It’s also common to remove stopwords. Stopwords are words that occur frequently in the language and don’t contain much information. English stopwords include “the”, “is”, “and”, “but”, “not”.

With a spaCy token, token.lemma_ returns the lemma, while token.is_stop returns a boolean True if the token is a stopword (and False otherwise).In [4]:

print(f"Token \t\tLemma \t\tStopword".format('Token', 'Lemma', 'Stopword'))
print("-"*40)
for token in doc:
    print(f"{str(token)}\t\t{token.lemma_}\t\t{token.is_stop}")
Token 		Lemma 		Stopword
----------------------------------------
Tea		tea		False
is		be		True
healthy		healthy		False
and		and		True
calming		calm		False
,		,		False
do		do		True
n't		not		True
you		-PRON-		True
think		think		False
?		?		False

Why are lemmas and identifying stopwords important? Language data has a lot of noise mixed in with informative content. In the sentence above, the important words are tea, healthy and calming. Removing stop words might help the predictive model focus on relevant words. Lemmatizing similarly helps by combining multiple forms of the same word into one base form (“calming”, “calms”, “calmed” would all change to “calm”).

However, lemmatizing and dropping stopwords might result in your models performing worse. So you should treat this preprocessing as part of your hyperparameter optimization process.

Pattern Matching

Another common Natural Language Processing task is matching tokens or phrases within chunks of text or whole documents. You can do pattern matching with regular expressions, but spaCy’s matching capabilities tend to be easier to use.

To match individual tokens, you create a Matcher. When you want to match a list of terms, it’s easier and more efficient to use PhraseMatcher. For example, if you want to find where different smartphone models show up in some text, you can create patterns for the model names of interest. First you create the PhraseMatcher itself.In [5]:

from spacy.matcher import PhraseMatcher
matcher = PhraseMatcher(nlp.vocab, attr='LOWER')

The matcher is created using the vocabulary of your model. Here we’re using the small English model you loaded earlier. Setting attr='LOWER' will match the phrases on lowercased text. This provides case insensitive matching.

Next you create a list of terms to match in the text. The phrase matcher needs the patterns as document objects. The easiest way to get these is with a list comprehension using the nlp model.In [6]:

terms = ['Galaxy Note', 'iPhone 11', 'iPhone XS', 'Google Pixel']
patterns = [nlp(text) for text in terms]
matcher.add("TerminologyList", None, *patterns)

Then you create a document from the text to search and use the phrase matcher to find where the terms occur in the text.In [7]:

# Borrowed from https://daringfireball.net/linked/2019/09/21/patel-11-pro
text_doc = nlp("Glowing review overall, and some really interesting side-by-side "
               "photography tests pitting the iPhone 11 Pro against the "
               "Galaxy Note 10 Plus and last year’s iPhone XS and Google Pixel 3.") 
matches = matcher(text_doc)
print(matches)
[(3766102292120407359, 17, 19), (3766102292120407359, 22, 24), (3766102292120407359, 30, 32), (3766102292120407359, 33, 35)]

The matches here are a tuple of the match id and the positions of the start and end of the phrase.In [8]:

match_id, start, end = matches[0]
print(nlp.vocab.strings[match_id], text_doc[start:end])
TerminologyList iPhone 11

Text Classification

A common task in Natural Language Processing is text classification. This is “classification” in the conventional machine learning sense, and it is applied to text. Examples include spam detection, sentiment analysis, and tagging customer queries.

Download the data required for this activity. The classifier will detect spam messages, a common functionality in most email clients.

import pandas as pd

# Loading the spam data
# ham is the label for non-spam messages
spam = pd.read_csv('../input/spam.csv')
spam.head(10)

Out[1]:

labeltext
0hamGo until jurong point, crazy.. Available only …
1hamOk lar… Joking wif u oni…
2spamFree entry in 2 a wkly comp to win FA Cup fina…
3hamU dun say so early hor… U c already then say…
4hamNah I don’t think he goes to usf, he lives aro…
5spamFreeMsg Hey there darling it’s been 3 week’s n…
6hamEven my brother is not like to speak with me. …
7hamAs per your request ‘Melle Melle (Oru Minnamin…
8spamWINNER!! As a valued network customer you have…
9spamHad your mobile 11 months or more? U R entitle…

Bag of Words

Machine learning models don’t learn from raw text data. Instead, you need to convert the text to something numeric.

The simplest common representation is a variation of one-hot encoding. You represent each document as a vector of term frequencies for each term in the vocabulary. The vocabulary is built from all the tokens (terms) in the corpus (the collection of documents).

As an example, take the sentences “Tea is life. Tea is love.” and “Tea is healthy, calming, and delicious.” as our corpus. The vocabulary then is {"tea", "is", "life", "love", "healthy", "calming", "and", "delicious"} (ignoring punctuation).

For each document, count up how many times a term occurs, and place that count in the appropriate element of a vector. The first sentence has “tea” twice and that is the first position in our vocabulary, so we put the number 2 in the first element of the vector. Our sentences as vectors then look likev1v2=[22110000]=[11001111]v1=[22110000]v2=[11001111]

This is called the bag of words representation. You can see that documents with similar terms will have similar vectors. Vocabularies frequently have tens of thousands of terms, so these vectors can be very large.

Another common representation is TF-IDF (Term Frequency – Inverse Document Frequency). TF-IDF is similar to bag of words except that each term count is scaled by the term’s frequency in the corpus. Using TF-IDF can potentially improve your models. You won’t need it here. Feel free to look it up though!

Building a Bag of Words model

Once you have your documents in a bag of words representation, you can use those vectors as input to any machine learning model. spaCy handles the bag of words conversion and building a simple linear model for you with the TextCategorizer class.

The TextCategorizer is a spaCy pipe. Pipes are classes for processing and transforming tokens. When you create a spaCy model with nlp = spacy.load('en_core_web_sm'), there are default pipes that perform part of speech tagging, entity recognition, and other transformations. When you run text through a model doc = nlp("Some text here"), the output of the pipes are attached to the tokens in the doc object. The lemmas for token.lemma_ come from one of these pipes.

You can remove or add pipes to models. What we’ll do here is create an empty model without any pipes (other than a tokenizer, since all models always have a tokenizer). Then, we’ll create a TextCategorizer pipe and add it to the empty model.In [2]:

import spacy

# Create an empty model
nlp = spacy.blank("en")

# Create the TextCategorizer with exclusive classes and "bow" architecture
textcat = nlp.create_pipe(
              "textcat",
              config={
                "exclusive_classes": True,
                "architecture": "bow"})

# Add the TextCategorizer to the empty model
nlp.add_pipe(textcat)

Since the classes are either ham or spam, we set "exclusive_classes" to True. We’ve also configured it with the bag of words ("bow") architecture. spaCy provides a convolutional neural network architecture as well, but it’s more complex than you need for now.

Next we’ll add the labels to the model. Here “ham” are for the real messages, “spam” are spam messages.In [3]:

# Add labels to text classifier
textcat.add_label("ham")
textcat.add_label("spam")

Out[3]:

1

Training a Text Categorizer Model

Next, you’ll convert the labels in the data to the form TextCategorizer requires. For each document, you’ll create a dictionary of boolean values for each class.

For example, if a text is “ham”, we need a dictionary {'ham': True, 'spam': False}. The model is looking for these labels inside another dictionary with the key 'cats'.In [4]:

train_texts = spam['text'].values
train_labels = [{'cats': {'ham': label == 'ham',
                          'spam': label == 'spam'}} 
                for label in spam['label']]

Then we combine the texts and labels into a single list.In [5]:

train_data = list(zip(train_texts, train_labels))
train_data[:3]

Out[5]:

[('Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat...',
  {'cats': {'ham': True, 'spam': False}}),
 ('Ok lar... Joking wif u oni...', {'cats': {'ham': True, 'spam': False}}),
 ("Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's",
  {'cats': {'ham': False, 'spam': True}})]

Now you are ready to train the model. First, create an optimizer using nlp.begin_training(). spaCy uses this optimizer to update the model. In general it’s more efficient to train models in small batches. spaCy provides the minibatch function that returns a generator yielding minibatches for training. Finally, the minibatches are split into texts and labels, then used with nlp.update to update the model’s parameters.In [6]:

from spacy.util import minibatch

spacy.util.fix_random_seed(1)
optimizer = nlp.begin_training()

# Create the batch generator with batch size = 8
batches = minibatch(train_data, size=8)
# Iterate through minibatches
for batch in batches:
    # Each batch is a list of (text, label) but we need to
    # send separate lists for texts and labels to update().
    # This is a quick way to split a list of tuples into lists
    texts, labels = zip(*batch)
    nlp.update(texts, labels, sgd=optimizer)

This is just one training loop (or epoch) through the data. The model will typically need multiple epochs. Use another loop for more epochs, and optionally re-shuffle the training data at the begining of each loop.In [7]:

import random

random.seed(1)
spacy.util.fix_random_seed(1)
optimizer = nlp.begin_training()

losses = {}
for epoch in range(10):
    random.shuffle(train_data)
    # Create the batch generator with batch size = 8
    batches = minibatch(train_data, size=8)
    # Iterate through minibatches
    for batch in batches:
        # Each batch is a list of (text, label) but we need to
        # send separate lists for texts and labels to update().
        # This is a quick way to split a list of tuples into lists
        texts, labels = zip(*batch)
        nlp.update(texts, labels, sgd=optimizer, losses=losses)
    print(losses)
{'textcat': 0.43349911164852983}
{'textcat': 0.6496019514395925}
{'textcat': 0.7863673793854753}
{'textcat': 0.873747565239956}
{'textcat': 0.9302242411678097}
{'textcat': 0.967743759023069}
{'textcat': 0.9961505934359941}
{'textcat': 1.0150542123599622}
{'textcat': 1.029843157208506}
{'textcat': 1.0401488806910077}

Making Predictions

Now that you have a trained model, you can make predictions with the predict() method. The input text needs to be tokenized with nlp.tokenizer. Then you pass the tokens to the predict method which returns scores. The scores are the probability the input text belongs to the classes.In [8]:

texts = ["Are you ready for the tea party????? It's gonna be wild",
         "URGENT Reply to this message for GUARANTEED FREE TEA" ]
docs = [nlp.tokenizer(text) for text in texts]
    
# Use textcat to get the scores for each doc
textcat = nlp.get_pipe('textcat')
scores, _ = textcat.predict(docs)

print(scores)
[[9.9994671e-01 5.3249827e-05]
 [1.1798984e-02 9.8820102e-01]]

The scores are used to predict a single class or label by choosing the label with the highest probability. You get the index of the highest probability with scores.argmax, then use the index to get the label string from textcat.labels.In [9]:

# From the scores, find the label with the highest score/probability
predicted_labels = scores.argmax(axis=1)
print([textcat.labels[label] for label in predicted_labels])
['ham', 'spam']

Evaluating the model is straightforward once you have the predictions. To measure the accuracy, calculate how many correct predictions are made on some test data, divided by the total number of predictions.

Word Embeddings

You know at this point that machine learning on text requires that you first represent the text numerically. So far, you’ve done this with bag of words representations. But you can usually do better with word embeddings.

Word embeddings (also called word vectors) represent each word numerically in such a way that the vector corresponds to how that word is used or what it means. Vector encodings are learned by considering the context in which the words appear. Words that appear in similar contexts will have similar vectors. For example, vectors for “leopard”, “lion”, and “tiger” will be close together, while they’ll be far away from “planet” and “castle”.

Even cooler, relations between words can be examined with mathematical operations. Subtracting the vectors for “man” and “woman” will return another vector. If you add that to the vector for “king” the result is close to the vector for “queen.”

Word vector examples

These vectors can be used as features for machine learning models. Word vectors will typically improve the performance of your models above bag of words encoding. spaCy provides embeddings learned from a model called Word2Vec. You can access them by loading a large language model like en_core_web_lg. Then they will be available on tokens from the .vector attribute.In [1]:

import numpy as np
import spacy

# Need to load the large model to get the vectors
nlp = spacy.load('en_core_web_lg')

In [2]:

# Disabling other pipes because we don't need them and it'll speed up this part a bit
text = "These vectors can be used as features for machine learning models."
with nlp.disable_pipes():
    vectors = np.array([token.vector for token in  nlp(text)])

In [3]:

vectors.shape

Out[3]:

(12, 300)

These are 300-dimensional vectors, with one vector for each word. However, we only have document-level labels and our models won’t be able to use the word-level embeddings. So, you need a vector representation for the entire document.

There are many ways to combine all the word vectors into a single document vector we can use for model training. A simple and surprisingly effective approach is simply averaging the vectors for each word in the document. Then, you can use these document vectors for modeling.

spaCy calculates the average document vector which you can get with doc.vector. Here is an example loading the spam data and converting it to document vectors.In [4]:

import pandas as pd

# Loading the spam data
# ham is the label for non-spam messages
spam = pd.read_csv('../input/nlp-course/spam.csv')

with nlp.disable_pipes():
    doc_vectors = np.array([nlp(text).vector for text in spam.text])
    
doc_vectors.shape

Out[4]:

(5572, 300)

Classification Models

With the document vectors, you can train scikit-learn models, xgboost models, or any other standard approach to modeling.In [5]:

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(doc_vectors, spam.label,
                                                    test_size=0.1, random_state=1)

Here is an example using support vector machines (SVMs). Scikit-learn provides an SVM classifier LinearSVC. This works similar to other scikit-learn models.In [6]:

from sklearn.svm import LinearSVC

# Set dual=False to speed up training, and it's not needed
svc = LinearSVC(random_state=1, dual=False, max_iter=10000)
svc.fit(X_train, y_train)
print(f"Accuracy: {svc.score(X_test, y_test) * 100:.3f}%", )
Accuracy: 97.312%

Document Similarity

Documents with similar content generally have similar vectors. So you can find similar documents by measuring the similarity between the vectors. A common metric for this is the cosine similarity which measures the angle between two vectors, aa and bb.cosθ=a⋅b∥a∥∥b∥cos⁡θ=a⋅b‖a‖‖b‖

This is the dot product of aa and bb, divided by the magnitudes of each vector. The cosine similarity can vary between -1 and 1, corresponding complete opposite to perfect similarity, respectively. To calculate it, you can use the metric from scikit-learn or write your own function.In [7]:

def cosine_similarity(a, b):
    return a.dot(b)/np.sqrt(a.dot(a) * b.dot(b))

In [8]:

a = nlp("REPLY NOW FOR FREE TEA").vector
b = nlp("According to legend, Emperor Shen Nung discovered tea when leaves from a wild tree blew into his pot of boiling water.").vector
cosine_similarity(a, b)

Out[8]:

0.7030031

I hope you must have got some idea of Natural Language processing from this post