Introduction to Named Entity Recognition using Prodigy

Alessio Lombardi
18 min readJun 15, 2021

Disclaimer: this tutorial is “the written version” of an excellent video tutorial by Ines Montani from Explosion.ai. Because I prefer to have a textual reference rather than to follow a long video, I took notes while reproducing the steps from the tutorial. I made sure to include all information that Ines provides, and I added some instructions from my own experience, especially on the set-up.

I hope this can be useful as a reference to others. I also included several links to the original video with specific timeframes in the course of the text.

Problem statement

Describe how online mentions of certain ingredients change over time.

E.g. garlic, onion, etc.

Challenge

Be able to discern whether a certain POS is actually an ingredient or not.

In this specific case, generally phrases are unambiguous: `garlic` will almost certainly mean the vegetable — the ingredient.

However, we need to find **all** the ingredients mentioned.

Some are common, so the annotation tool will be able to do the work for us.

We use prodigy to create a model to annotate ingredients. To do so, we will create an initial draft model that is able to identify ingredients with some imprecision. We will then use Prodigy to correct the model’s mistakes and also give it more datapoint for labels it may miss — in other words, aiding the model with some manual annotation.

In total, the manual annotation time spent on this exercise in the original version by Ines was about 2.5 hours. You can spend as long as you want, but it will reduce the precision of the model. I only spent ~1h doing labelling.

This is a very high level of efficiency for this kind of task. You can think of running a quick modelling exercise in a morning, having some initial results ready to present in the afternoon, enabling quick iterations, avoiding having to involve more people, meetings, etc.

Dataset

Reddit comments corpus; specifically, recent years’ comments from the r/Cooking subreddit.

We will go through a sample of the comments and we will be able to highlight ingredients whenever they are mentioned.

Installation prerequisites

We will use Prodigy, Spacy and Sense2vec. Making the combination of these work today (09/06/2021) requires:

  • pip install prodigy ==> version out currently is 1.10.8. The versions 1.10.x are compatible only with spacy 2.x.
  • pip install -U spacy==2.3.7 ==> version out now would be 3.x+, incompatible with prodigy.
  • pip install sense2vec==1.0.3 ==> version out now is 2.0.0, incompatible with spacy 2.x, so we need to stay on 1.0.3.

Steps

  1. Create a phrase list and related match patterns for ingredients.
  2. Label all ingredients in a sample of texts with the help of the match patterns.
  3. Train and evaluate an initial, draft model. This is helpful to see how things are going.
  4. Label more examples by correcting the model’s prediction.
  5. Train a new model with improved accuracy.
  6. Run the model over 2m+ comments from Reddit.
  7. Data visualization — I will not cover this, see the original video instead.

1) Create a phrase list and related match patterns for ingredients

We’ll use a word vectors model, previously trained on Reddit comments, which includes vectors for multi-word expressions (e.g. `cottage cheese`).

  • Download the sense2vec model: https://github.com/explosion/sense2vec (s2v_reddit_2015_md)
  • Extract the content. The compressed files contains a series of subfolders. Navigate in until you find the folder containing several files, including a “cfg” file. Extract this content to a folder of your choice, which I will name `s2v_reddit_2015_md`.

Prodigy works with Python functions called “recipes”.

We will use the `sens2vec.tech` recipe, which is included in Prodigy.

python -m prodigy sense2vec.teach food_terms path/to/s2v_reddit_2015_md --seeds "garlic, avocado, cottage cheese, olive oil, cumin, chicken breast, beef, iceberg lettuce"

  • `sense2vec.teach` is the prodigy recipe we will use here.
  • `food_terms` is the dataset name. This dataset will bundle several annotations together. We will later improve this dataset by adding to it, and we will export from it.
  • `path/to/s2v_reddit_2015_md` is the path to the unpacked sense2vec trained model.
  • Finally, the argument ` — seeds “some strings”` is used to specify some initial seed phrases. These are going to be used to identify other similar phrases in the vectors.

Pressing enter will initialize Prodigy and spin up its local web app.

Navigate to http://localhost:8080/ so we can start annotating.

The interface already shows us a new ingredient that was not among the ones suggested in the `seeds` argument: `spinach` is suggested, and with a high similarity score. This is an indication that we are in the correct vector space, and things are going well.

We can now choose “accept” (keyboard shortcut: `A`) or “reject”(keyboard shortcut: `X`) on the suggested term.
Because we think that `spinach` is a good match, we accept.

Prodigy will then progress and show more terms.

Other suggested matches that follow included for me: `feta cheese`, `parmesan cheese`, etc.

which are all good and show that we are getting suggestions for multi-word expressions.

At some point, we may get this suggestion:

this is an indication that the misspelling `avacado` instead of `avocado` is common enough that it made the cut and got its own vector. This means that the word is relevant, despite it not being an actual, correct word. In other words, when people type `avacado`, there is a high probability that they will actually mean the avocado ingredient. It is therefore advisable to accept this misspelling and have it in our word list and match patterns.

An example of word we want to reject:

only `broccoli` is the ingredient, so `steamed broccoli` should be rejected.

After some words ( I got to 289 total proposed words), you will get a “No tasks available” screen. On the left column you can see the total accepted/rejected/ignored words.

We can now move on and start creating our patterns.

Alternatively, we restart the server with new seed terms, if we think it could be useful.

Let’s save hit the save button or `Ctrl + S` to save the annotations in the database.

Returning to terminal, if we exit the WebApp server (e.g. press `Ctrl + C`), Prodigy will display a message on the the annotations saved on the server.

We can now reuse this dataset to create our match patterns.

Create match patterns related to the phrase list

To create the match patterns, we can use the Prodigy built-in recipe `terms.to-patterns`:

Python -m prodigy terms.to-patterns food_terms --label INGRED --spacy-model blank:en > ./food_patterns.jsonl
  • `food_terms`: the dataset we want to use.
  • `label`: the entity label we want to use. Shortened to INGRED to take less space.
  • `spacy-model`: we’ll use `blank:en`, a Spacy model that is simple blank English tokenizer.
  • The output is forwarded (`>`) to a “food_patterns.jsonl” file (a newline-delimited json). This file will contain our patterns.

The match patterns will look like:

{“label”:”INGRED”,”pattern”:[{“lower”:”cottage”},{“lower”:”cheese”}]}

Some examples:

We are now ready to annotate the data.

2) Data annotation

We want to use the match patterns to label all ingredients in a sample of texts.

The sample of texts can be downloaded from the links provided here. In particular:

Here is an example of an entry of these files:

```

{"text":"What is a stick blender","meta":{"section":"Cooking","utc":1508084693}}

```

the timestamp will be useful to see how mentions of ingredients in Reddit comments changes over time.

To start annotating, we have different choices. We can do it via Prodigy's `ner.manual` recipe:

Python -m prodigy ner.manual food_annotations blank:en ./reddit_r_cooking_sample.jsonl --label INGRED --patterns food_patterns.jsonl
  • `food_annotations` is the name of our annotations dataset.
  • `blank:en` is as before a Spacy model, a basic blank English tokenizer
  • `-- label` is the label that we want to assign to the entity. Obviously the same as the pattern match label.
    If we have multiple labels, we can also pass in a comma-separated list here.
  • `--patterns`: we provide the `food_patterns.jsonl` file containing the match patterns that we just created.

Once you've launched this, a new Prodigy WebApp for annotation will spin up.

If a text contains a match, Prodigy will automatically highlight it for us.

For example, here we already have three matches:

This WebApp exposes the typical functionality that people think of when you say “annotation”: it allows to click, highlight and assign annotations to spans of text, but also something more. Because the text has been already tokenized, the app knows where a word starts and ends. So you don’t have to select the exact text span: by simply clicking on a part of the text, the selection snaps to the token boundaries. The token boundaries are also what the model is going to predict. You can try this by selecting `mock duck` in the first sentence, without being very precise (e.g. start selecting from `ck` to `du`). This way you spend less time highlighting and you can go faster.

Let’s select the missing entities here:

Then hit “accept” and move on the next spans.

Annotation policy and Label scheme

These are two things that are not often mentioned, but are very important to the annotation exercise.

We need to define a precise annotation policy for labelling. In our case, the policy could be something like

“_annotate exclusively food terms that are actually used to mean ingredients_”.

Our label scheme is pretty simple and includes only INGRED, but in general it may have more labels.

Once set, the annotation policy and label scheme should be kept in mind during the annotation exercise, and we should strive for consistency. It’s actually very common to start the annotation exercise and realize mid-way that, for example, we limited ourselves too much with too few labels and we need more; in such cases, the best thing is to review the annotation policy/label scheme and start again.

Here is an example of a potential **ignore**:

Here, `curry` is an ingredient, but it’s used ambiguously: it appears to be used as part of a dish name more than as a standalone ingredient.

We can go ahead and press “skip” (spacebar) and avoid to use this if we are not sure.

It’s better to continue fast and keep our flow than stop and think too long. Because we have a large dataset, losing an example won’t make much difference.

Ambiguous spans may be reviewed by domain experts. If multiple expert disagree on what to label, that’s a good indication that the text should be ignored.

We should also be skipping text samples that are only single words, or links, because it’s better not to include them.

Some text spans will not have any ingredient mentioned:

In this case we can just go on and **accept** it, because it’s already correct.

Showing the model examples of what is NOT an entity is just as important as showing it what is an entity.

This example is could be a **rejection**:

Here, the ingredient salt appears, but because a dash character is attached to it, the tokenizer will not be able to capture the actual word:

for this reason, it’s better to reject this text span. This way, we differentiate it substantially from the text spans that we’ve ignored.

Sometimes it makes sense to go back to the rejected examples after the exercise is finished to see if there are common problems that can be fixed by tweaking the tokenization rules; in particular, if you are dealing with unusual punctuation or many special cases.

An example of an ambiguous text span:

Here it’s probably safe to say that “Guinness extra stout”, “Guinness foreign extra stout”, “stouts” and “smoked porters” are used as ingredients, so we should label them.

In the bottom right corner, you can see a list of “pattern IDs”, that are the line numbers where a pattern matched. These are useful after the exercise is finished, in case you want to go back to a particular span and check how a match was created by reading the context.

How many examples to label?

Ines here suggests that the annotation process for this dataset could go for about 500 text spans analysed, which should give us about 400 “accepted” ones to train our model on. I decided to stop at around 250 as it takes some time.

Also, we should remember that we need labelled examples also for evaluation of the model, so we should account for a bit of “extra samples” in addition to the ones strictly needed for training.

In general, a few hundred examples can be a good number for training, especially if we use some additional technique like transfer learning later.

Once the exercise is done, we can hit “save” again, return to the command line and exit the server (Ctrl + C).

3) Train an initial model

Train and evaluate an initial, draft model. This is helpful to see how things are going, and we will be able to build on top of it.

In general, it’s important to have a tight “feedback loop” with the annotation: we should be sure that the model is learning the right thing, otherwise we need to go back and revise our annotation policy/label scheme. It’s important to validate our ideas early on, so we don’t waste time on training and improving models that are not going to work well.

This initial model could be something that we use to suggest entities, and all we do manually is correct its mistakes. This gives us a good idea of the model behaviour. It also improves the efficiency of collecting more data, because we have a model that already gives us something.

Token-to-vector pretrained layer

To make the most of our small training dataset, we want to use a pre-trained token-to-vector layer, which we will use through Prodigy to initialize our model with pretrained representations.

We can use Spacy’s `pretrain` command (https://spacy.io/api/cli#pretrain) to pretrain weights on the Reddit corpus.

The idea is similar to the language model pretraining that was popularized by ElMo, BERT, ULMFiT and so on.

The only difference here is that we are not training the model to predict the next word, but to predict an approximation; in this case, the word’s word vector. This makes the artifact that we are training much smaller, and it also makes the runtime speed much faster. Ines did this pretraining, which took ~8 hours on GPU, and it can be found here: tok2vec_cd8_model289.bin.

This layer was trained with the vectors `en_vectors_web_lg` that are available with Spacy. This pretraining dataset will be needed for the model training, as we’ll see below.

Model training

Once we have the pretrained token-to-vector model, we can proceed to training our model.

Prodigy gives us a “train” recipe that is a relatively thin wrapper around Spacy’s training API, specifically optimized to run quick experiments and work with existing Prodigy datasets and annotations.

The `train` recipe command can be issued as follows:

Python -m prodigy train ner food_annotations en_vectors_web_lg — init-tok2vec ./tok2vec_cd8_model289.bin — output ./tmp_model — eval-split 0.2
  • `ner` = Named Entity Recognition, the component that we want to train
  • ` food_annotations` = our annotation dataset, which will be used for training.
  • `en_vectors_web_lg` = the base model. We need to use the same vectors that we used for pretraining the token-to-vector layer, otherwise it’s not going to work. So we specify the large English vectors package that can be downloaded with Spacy (`python -m spacy download en_vectors_web_lg`). See also differences with en_core_web_lg.
  • ` — init-tok2vec ./somefolder` = allows to specify the path to the pretrained token-to-vector weights, which is the output of the “spacy pretrain”. This pretrained model is used to initialize our model.
  • ` — output ./tmp_model` = the resulting model files will be put here.
  • ` — eval-split 0.2` = the percentage of examples to hold back for evaluation.
    We choose 20% here, which is generally just enough, and it could mean that our evaluation is not going to be very stable. In other words, the evaluation with 20% test samples should not be the result you should be reporting in your paper.

On a quite weak CPU, this training took < 5 minutes:

The results shown by Ines in her video, obtained with twice as much datapoints (~500) as my version, are instead:

After the training has completed, we can make some observations:

  • The accuracy (precision/recall/f-score) was progressively increasing.
  • The model looks internally consistent, which is a good sign. It means that our idea is probably valid and worth to keep pursuing. However, more analytics may be useful at this point to understand if it’s worth our efforts, see below.
  • A final summary per each label is shown. This is useful when you are training multiple labels at the same time, to compare their results. This summary score is what are aiming to beat later, after we’ve collected more annotations.
  • The model that gets saved is the one with the best F-score.

Is it worth to improve this model with more data? Prodigy’s `train-curve`

In order to see whether the model is worth keeping exploring with additional data, we can use Prodigy’s `train-curve` recipe.

This will run the training several times with different amounts of data. For example, 4 times with growing data size: 25%, 50%, 75% and 100% of the training data. This gives a good indication on whether supplying more data will improve the model.

The command is almost the same as the `train` recipe’s one:

Python -m prodigy train-curve ner food_annotations en_vectors_web_lg — init-tok2vec path\to\tok2vec_cd8_model289.bin — eval-split 0.2

In my case, I had the following results:

If the difference in accuracy between increasing sample sizes is positive it’s a good sign, otherwise probably the model does not make much sense. In particular, we want to look at the last increase (between 75–100%): if there is an improvement there, it makes sense to supply more data *of the same type* as the data we have already supplied. If the last segment shows a negative difference in accuracy, but for all others is positive, then maybe we could improve the model by supplying a type of data *different from the one supplied so far*.

This is sensibly different to what Ines shows in her video. We should note that I had only ~250 samples, whereas Ines had >500. At this size, my dataset brings hardly believable results. Ines’ results shown in her video are instead:

4) Use the model to perform labelling and correct its mistakes

Given the result of train-curve, we can deem ourselves satisfied with our model, so we can progress to the next step and use this model to do the labelling for us; we will try to improve the model only by telling it when it is wrong.

Prodigy’s recipe for that is called `ner.correct`:

Python -m prodigy ner.correct food_annotations_correct path\to\tmp_model path\to\reddit_r_cooking_sample.jsonl — label INGRED — exclude food_annotations
  • `food_annotations_correct` name of the dataset we want to save the automated annotations to. It’s a good idea to pick a different name from the previously annotated dataset. This is true in general for different experiments: it’s always good to keep separate datasets for different explorations, because it’s easy to merge annotations together if we need to; vice-versa, it’s hard to separate them, and it’s harder to start over if things go wrong.
  • `path/to/tmp_model` the path to our draft model.
  • `path/to/reddit_r_cooking_sample.jsonl`: we need to pass in also the input texts again
  • ` — label INGRED`: and we also need to pass in the label that we want the model to annotate.
  • ` — exclude food_annotations`: because we are annotating the same input data that we have partially annotated ourselves, this asks Prodigy to avoid annotating the same texts that we have already annotated before. If an example is already present in food_annotations, we will not see it again now by launching `ner.correct`.

Launching the command allows to again enter Prodigy’s annotation WebApp. This time, the labels that we see are exclusively predicted by the model.

We can see that it already performs quite well. If something is missing, we can just add it, as we did before with the pattern matches. If it predicts a wrong span, we can click on it and remove it; for example:

here, “Mapo tofu” is actually the name of a dish. The model is correctly identifying the second “tofu” as an ingredient, but the first one should be removed.

During this exercise, we may see a lot of the same entities. We should be able to see quite a few ingredients that weren’t in the training data, which is good: we want the model to be able to generalize based on the examples that we show it. We also want to find similar ingredients if they are mentioned in similar contexts.

I had a strange behaviour here: I had the same texts looping after about 30 different texts. I kept annotating until I reached about 90 samples, but they really ended up being ~30 samples checked 3 times. Perhaps this is due to the limited number of data points I have? But certainly they are more than 30. Anyway, Ines suggests that we get to ~500 samples in total again. I did about ~100.

Once done, hit the save button and exit the server.

5) Train a new model with improved accuracy

We now have 2 datasets with several manual annotations, depending on how many you have done. In my case, I did ~250 for the first draft model, and added about ~100 in the “correction” phase (step 4). This means that our final model can make use of both datasets, for a total of ~350 datapoints. These are not really many, and in her video Ines had ~1000 total (500+500).

We will now train a model and save it in a separate file.

Python -m prodigy train ner food_annotations, food_annotations_correct en_vectors_web_lg — init-tok2vec ./tok2vec_cd8_model289.bin — output ./food_model — eval-split 0.2 — n-iter 20
  • `food_annotations,food_annotations_correct` we now pass in the names of both datasets. Do not leave a space after the comma or they are not recognised.
  • We use the same Spacy base model `en_vectors_web_lg` and the same pretrained token-to-vector weights ` — init-tok2vec ./somefolder` as before.
  • `output ./food_model` our model files will be put in this folder, which should be a different folder from our draft model’s. It’s always better to separate rather than update the existing draft model, as it is easier to go back and start clean if needed.
  • `n-iter 20`: in this case, we want to do more iterations (default is 10). This can be useful as the model grows larger to make sure that we don’t miss potentially better results.

This time my results were:

As a comparison, the results shown by Ines in her video obtained using many more datapoints (~1000, against my ~350) are:

In my case, the some metrics have increased more significantly than others:

  • Precision: 60% → 60.2%; not quite good.
  • Recall: 69.23 → 87.28%; good improvement.
  • F-Score: 64.28 → 84.95%; good improvement.

Ines’ results were overall always better, as she included more datapoints:

  • Precision: 75.90% → 85.74%
  • Recall: 75.90% → 87.28%
  • F-Score: 75.90% → 84.95%

This model is now our final model. We can run it over large datasets to extract Named Entities corresponding to food ingredients. All model produced by Prodigy are Spacy models, so they can be loaded and used using spacy commands.

6) — 7) Run the model and Data visualization

The video tutorial then continues to run the model to extract ingredients as named entities from a large dataset; we want to then compute their count over time. The final objective is to use this count to make an animated bar chart that displays the evolution in mentions of food ingredients over time, using the Reddit’s 7-years comment dataset.

Again, all model produced by Prodigy are Spacy models, so they can be loaded and used using spacy commands. The script that Ines used is the following:

Here, the important things are:

  • line 8, `nlp = spacy.load(SPACY_MODEL)`: replace SPACY_MODEL with your final model directory path.
  • line 9: you can read the reddit model file as a simple json.
  • line 12: nlp.pipe(...) we can use Spacy’s useful pipeline tool to operate on the data.

Please see the original tutorial video for details on the data visualization.

--

--