How to import OpenAI NER pre-annotations

This tutorial explains how to use a powerful OpenAI Large Language Model (LLM) to generate pre-annotations, which will then be imported into a Named Entity Recognition (NER) Kili project.

Nowadays, LLMs are capable of performing many NLP tasks, including NER, without being explicitly trained on them, which is known as zero-shot learning.

By using a well-designed prompt, we can generate high-quality pre-annotations for the NER task, as demonstrated in the pre-annotations quality evaluation section of this tutorial.

Throughout this tutorial, we will thus:

Load a dataset CoNLL2003 from the HuggingFace datasets library.
Learn how to use the OpenAI API to generate pre-annotations.
Import data and labels into a NER project on Kili.
Evaluate the quality of the generated pre-annotations.

Setup

Let's start this tutorial by installing the packages we will need later on.

%pip install kili datasets evaluate ipywidgets openai scikit-learn numpy rich

import getpass
import json
import os
from collections import defaultdict

import numpy as np
import openai
from rich.console import Console
from rich.table import Table

Data preparation

In this tutorial, we will use the CoNLL2003 dataset from the Hugging Face repository. This dataset contains more than 10,000 sentences annotated with named entities.

from datasets import load_dataset

To speed up the process, we will use a limited number of samples. We will also remove sentences that do not contain enough words.

MAX_DATAPOINTS = 20
MIN_NB_TOKENS_PER_SENTENCE = 9

dataset = load_dataset("conll2003", split="train").filter(
    lambda datapoint: len(datapoint["tokens"]) >= MIN_NB_TOKENS_PER_SENTENCE
)
dataset = dataset.select(range(MAX_DATAPOINTS))

print(dataset)

Dataset({
    features: ['id', 'tokens', 'pos_tags', 'chunk_tags', 'ner_tags'],
    num_rows: 20
})

for i in range(3):
    print(dataset[i])

{'id': '0', 'tokens': ['EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.'], 'pos_tags': [22, 42, 16, 21, 35, 37, 16, 21, 7], 'chunk_tags': [11, 21, 11, 12, 21, 22, 11, 12, 0], 'ner_tags': [3, 0, 7, 0, 0, 0, 7, 0, 0]}
{'id': '3', 'tokens': ['The', 'European', 'Commission', 'said', 'on', 'Thursday', 'it', 'disagreed', 'with', 'German', 'advice', 'to', 'consumers', 'to', 'shun', 'British', 'lamb', 'until', 'scientists', 'determine', 'whether', 'mad', 'cow', 'disease', 'can', 'be', 'transmitted', 'to', 'sheep', '.'], 'pos_tags': [12, 22, 22, 38, 15, 22, 28, 38, 15, 16, 21, 35, 24, 35, 37, 16, 21, 15, 24, 41, 15, 16, 21, 21, 20, 37, 40, 35, 21, 7], 'chunk_tags': [11, 12, 12, 21, 13, 11, 11, 21, 13, 11, 12, 13, 11, 21, 22, 11, 12, 17, 11, 21, 17, 11, 12, 12, 21, 22, 22, 13, 11, 0], 'ner_tags': [0, 3, 4, 0, 0, 0, 0, 0, 0, 7, 0, 0, 0, 0, 0, 7, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]}
{'id': '4', 'tokens': ['Germany', "'s", 'representative', 'to', 'the', 'European', 'Union', "'s", 'veterinary', 'committee', 'Werner', 'Zwingmann', 'said', 'on', 'Wednesday', 'consumers', 'should', 'buy', 'sheepmeat', 'from', 'countries', 'other', 'than', 'Britain', 'until', 'the', 'scientific', 'advice', 'was', 'clearer', '.'], 'pos_tags': [22, 27, 21, 35, 12, 22, 22, 27, 16, 21, 22, 22, 38, 15, 22, 24, 20, 37, 21, 15, 24, 16, 15, 22, 15, 12, 16, 21, 38, 17, 7], 'chunk_tags': [11, 11, 12, 13, 11, 12, 12, 11, 12, 12, 12, 12, 21, 13, 11, 12, 21, 22, 11, 13, 11, 1, 13, 11, 17, 11, 12, 12, 21, 1, 0], 'ner_tags': [5, 0, 0, 0, 0, 3, 4, 0, 0, 0, 1, 2, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 5, 0, 0, 0, 0, 0, 0, 0]}

Here is the meaning of each feature in the dataset:

id: A unique identifier for each token in a sentence.
tokens: The tokens (words or punctuation marks) in a sentence.
pos_tags: Part-of-speech tags for each token in the sentence. Part-of-speech tagging is the process of assigning a tag to each word in a sentence that indicates its part of speech (e.g., noun, verb, adjective, etc.).
chunk_tags: Chunking tags for each token in the sentence. Chunking is the process of grouping words into meaningful phrases based on their syntactic structure.
ner_tags: Named Entity Recognition (NER) tags for each token in the sentence. NER is the task of identifying named entities in text and classifying them into pre-defined categories such as person, organization, location, etc.

The sentences are split into tokens. We can regroup the tokens for later use:

fix_joined_tokens_map = {" .": ".", "( ": "(", " )": ")", " 's ": "'s ", "s ' ": "s' "}

sentence_column = []
for datapoint in dataset:
    sentence = " ".join(datapoint["tokens"])
    for before, after in fix_joined_tokens_map.items():
        sentence = sentence.replace(before, after)
    sentence_column.append(sentence)

dataset = dataset.add_column("sentence", sentence_column)

print(dataset[0])

{'id': '0', 'tokens': ['EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.'], 'pos_tags': [22, 42, 16, 21, 35, 37, 16, 21, 7], 'chunk_tags': [11, 21, 11, 12, 21, 22, 11, 12, 0], 'ner_tags': [3, 0, 7, 0, 0, 0, 7, 0, 0], 'sentence': 'EU rejects German call to boycott British lamb.'}

NER_TAGS_ONTOLOGY = {
    "O": 0,
    "B-PERSON": 1,
    "I-PERSON": 2,
    "B-ORGANIZATION": 3,
    "I-ORGANIZATION": 4,
    "B-LOCATION": 5,
    "I-LOCATION": 6,
    "B-MISCELLANEOUS": 7,
    "I-MISCELLANEOUS": 8,
}

NER_TAGS_ONTOLOGY is a dictionary that maps the named entity tags in the CoNLL2003 dataset to integer labels. Here is the meaning of each key-value pair in the dictionary:

O: Represents the tag "O" which means that the token is not part of a named entity.
B-PERSON: Represents the beginning of a person.
I-PERSON: Represents a token inside a person.
B-ORGANIZATION: Represents the beginning of an organization.
I-ORGANIZATION: Represents a token inside an organization.
B-LOCATION: Represents the beginning of a location.
I-LOCATION: Represents a token inside a location.
B-MISCELLANEOUS: Represents the beginning of a miscellaneous.
I-MISCELLANEOUS: Represents a token inside a miscellaneous.

During the training of a NER model, the entity names will be converted to integer labels using such a dictionary.

Connect with ChatGPT API

Let's use the OpenAI API to get the pre-annotations for our dataset.

if "OPENAI_API_KEY" in os.environ:
    OPENAI_API_KEY = os.environ["OPENAI_API_KEY"]
else:
    OPENAI_API_KEY = getpass.getpass("Please enter your OpenAI API key: ")

openai.api_key = OPENAI_API_KEY

We can now define the parameters that will be used during the query to OpenAI model:

model: the model that will be used to generate the pre-annotations. The full list is available under this link.
temperature: the temperature of the model. The higher the temperature, the more random the text. The lower the temperature, the more likely it is to predict the next word. The default value is 0.7. It should be between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic.
max_tokens: the maximum number of tokens to generate. The default value is 64. It should be between 1 and 4096, depending on the model.

openai_query_params = {"model": "text-davinci-003", "temperature": 0, "max_tokens": 1024}


def ask_openai(prompt: str, openai_query_params=openai_query_params) -> str:
    response = openai.Completion.create(
        prompt=prompt,
        **openai_query_params,
    )
    return response["choices"][0]["text"]

Let's test the API with a simple prompt:

print(ask_openai("Hello, are you here?"))

Yes, I am here. How can I help you?

Prompt design

To get pre-annotations for our dataset, we need to create a prompt that tells the model what to do:

base_prompt = """In the sentence below, give me the list of:
- organization named entity
- location named entity
- person named entity
- miscellaneous named entity.
Format the output in json with the following keys:
- ORGANIZATION for organization named entity
- LOCATION for location named entity
- PERSON for person named entity
- MISCELLANEOUS for miscellaneous named entity.
Sentence below:
"""

Let's see if the model understands the prompt well on a simple example:

test_sentence = (
    "Elon Musk is the CEO of Tesla and SpaceX. He was born in South Africa and now lives in the"
    " USA. He is one of the founders of OpenAI."
)

print(ask_openai(base_prompt + test_sentence))

{
  "ORGANIZATION": ["Tesla", "SpaceX", "OpenAI"],
  "LOCATION": ["South Africa", "USA"],
  "PERSON": ["Elon Musk"],
  "MISCELLANEOUS": []
}

Looks really good! Let's now process all sentences in our dataset with the previous prompt.

Create the pre-annotations

In the code below, we will use the OpenAI API to get the pre-annotations for each sentence in our dataset.

openai_answers = []
for datapoint in dataset:
    sentence = datapoint["sentence"]
    answer = ask_openai(base_prompt + sentence)
    try:
        answer_json = json.loads(answer)
    except json.JSONDecodeError:
        print(f"Wrong json formatting:\n{answer}")
        answer_json = {"ORGANIZATION": [], "LOCATION": [], "PERSON": [], "MISCELLANEOUS": []}
    openai_answers.append(answer_json)

print(openai_answers[:3])

[{'ORGANIZATION': ['EU', 'German'], 'LOCATION': ['British'], 'PERSON': [], 'MISCELLANEOUS': ['lamb']}, {'ORGANIZATION': ['European Commission'], 'LOCATION': ['German', 'British'], 'PERSON': [], 'MISCELLANEOUS': ['mad cow disease']}, {'ORGANIZATION': ["European Union's veterinary committee"], 'LOCATION': ['Germany', 'Britain'], 'PERSON': ['Werner Zwingmann'], 'MISCELLANEOUS': []}]

We need to sanitize the json to make sure that the values are of type list:

for i, _ in enumerate(openai_answers):
    json_dict = openai_answers[i]
    for category in json_dict:
        if isinstance(json_dict[category], str):
            json_dict[category] = [json_dict[category]]
        elif isinstance(json_dict[category], list):
            continue
        else:
            print(f"Unknown value type '{json_dict[category]}' for value '{json_dict[category]}'")
            json_dict[category] = []

Import dataset and pre-annotations to Kili

Now that we have both the data and the pre-annotations, we can import them to a Kili project.

from kili.client import Kili

kili = Kili(
    # api_endpoint="https://cloud.kili-technology.com/api/label/v2/graphql",
    # the line above can be uncommented and changed if you are working with an on-premise version of Kili
)

Below, we define the ontology (json interface) of the project. We define the 4 classes as well as their corresponding colors:

COLORS = ["#1f77b4", "#ff7f0e", "#2ca02c", "#d62728"]

ENTITY_TYPES = [
    ("PERSON", "Person"),
    ("ORGANIZATION", "Organization"),
    ("LOCATION", "Location"),
    ("MISCELLANEOUS", "Miscellaneous"),
]

ENTITY_TYPES_WITH_COLORS = [
    (entity_type[0], entity_type[1], color) for entity_type, color in zip(ENTITY_TYPES, COLORS)
]
print(ENTITY_TYPES_WITH_COLORS)

[('PERSON', 'Person', '#1f77b4'), ('ORGANIZATION', 'Organization', '#ff7f0e'), ('LOCATION', 'Location', '#2ca02c'), ('MISCELLANEOUS', 'Miscellaneous', '#d62728')]

json_interface = {
    "jobs": {
        "NAMED_ENTITIES_RECOGNITION_JOB": {
            "mlTask": "NAMED_ENTITIES_RECOGNITION",
            "content": {
                "categories": {
                    name: {"name": name_pretty, "children": [], "color": color}
                    for name, name_pretty, color in ENTITY_TYPES_WITH_COLORS
                },
                "input": "radio",
            },
            "instruction": "",
            "required": 1,
            "isChild": False,
        }
    },
}

Let's now create the project with its ontology:

project = kili.create_project(
    title="[Kili SDK Notebook]: CoNLL Named Entity Recognition with OpenAI pre-annotations",
    input_type="TEXT",
    json_interface=json_interface,
)
project_id = project["id"]

We now import the sentences to the project:

external_id_array = []
content_array = []
for datapoint in dataset:
    sentence = datapoint["sentence"]
    content_array.append(sentence)
    external_id_array.append(datapoint["id"])

print(content_array[:3])
print(external_id_array[:3])

['EU rejects German call to boycott British lamb.', 'The European Commission said on Thursday it disagreed with German advice to consumers to shun British lamb until scientists determine whether mad cow disease can be transmitted to sheep.', "Germany's representative to the European Union's veterinary committee Werner Zwingmann said on Wednesday consumers should buy sheepmeat from countries other than Britain until the scientific advice was clearer."]
['0', '3', '4']

kili.append_many_to_dataset(
    project_id=project_id, content_array=content_array, external_id_array=external_id_array
)

{'id': 'clf14l26401or0jv4e0d7d9ge'}

If you go to the project page, you should be able to see your assets:

And on the labeling interface, you will see the sentence and the ontology:

We can finally import our OpenAI-generated pre-annotations!

json_response_array = []

for datapoint, sentence_annotations in zip(dataset, openai_answers):
    full_sentence = datapoint["sentence"]
    annotations = []  # list of annotations for the sentence
    for category, _ in ENTITY_TYPES:
        sentence_annotations_cat = sentence_annotations[category]
        for content in sentence_annotations_cat:
            begin_offset = full_sentence.find(content)
            assert (
                begin_offset != -1
            ), f"Cannot find offset of '{content}' in sentence '{full_sentence}'"
            annotation = {
                "categories": [{"name": category}],
                "beginOffset": begin_offset,
                "content": content,
            }
            annotations.append(annotation)

    json_resp = {"NAMED_ENTITIES_RECOGNITION_JOB": {"annotations": annotations}}
    json_response_array.append(json_resp)

print(json_response_array[0])

{'NAMED_ENTITIES_RECOGNITION_JOB': {'annotations': [{'categories': [{'name': 'ORGANIZATION'}], 'beginOffset': 0, 'content': 'EU'}, {'categories': [{'name': 'ORGANIZATION'}], 'beginOffset': 11, 'content': 'German'}, {'categories': [{'name': 'LOCATION'}], 'beginOffset': 34, 'content': 'British'}, {'categories': [{'name': 'MISCELLANEOUS'}], 'beginOffset': 42, 'content': 'lamb'}]}}

We then import the annotations using the kili.create_predictions() method:

kili.create_predictions(
    project_id,
    external_id_array=external_id_array,
    json_response_array=json_response_array,
    model_name=openai_query_params["model"],
)

{'id': 'clf14l26401or0jv4e0d7d9ge'}

In the main project page, you should now be able to see that the assets have been pre-annotated with the model you chose before:

On the labeling interface for a specific asset, you can see the pre-annotations:

Great! We have successfully pre-annotated our dataset. Looks like this solution has the potential to save us a lot of time in future projects.

Pre-annotations quality evaluation

Because OpenAI-generated pre-annotations are not perfect, it would be great to have a way to measure the model's accuracy.

Since our dataset CoNLL2003 has been annotated, we can easily evaluate the quality of the pre-annotations generated by OpenAI.

def format_sentence_annotations(sentence_annotations):
    """Maps a token to its NER tag (B-ORGANIZATION, I-ORGANIZATION, etc.) class value."""
    ret = defaultdict(list)
    for category, _ in ENTITY_TYPES:
        sentence_annotations_cat = sentence_annotations[category]
        for content in sentence_annotations_cat:
            content_split = content.split(" ")
            for i, token in enumerate(content_split):
                if i == 0:
                    ret[token].append(NER_TAGS_ONTOLOGY[f"B-{category}"])
                else:
                    ret[token].append(NER_TAGS_ONTOLOGY[f"I-{category}"])
    return ret


references = []
predictions = []
for datapoint, sentence_annotations in zip(dataset, openai_answers):
    references.append(datapoint["ner_tags"])

    sentence_annotations = format_sentence_annotations(sentence_annotations)
    ner_tags_predicted = []
    for token in datapoint["tokens"]:
        if token in sentence_annotations and len(sentence_annotations[token]) > 0:
            ner_tags_predicted.append(sentence_annotations[token][0])
            del sentence_annotations[token][0]
        else:
            ner_tags_predicted.append(NER_TAGS_ONTOLOGY["O"])
    predictions.append(ner_tags_predicted)

print(dataset[0]["tokens"])
print(references[0])
print(predictions[0])
print(NER_TAGS_ONTOLOGY)

['EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.']
[3, 0, 7, 0, 0, 0, 7, 0, 0]
[3, 0, 3, 0, 0, 0, 5, 7, 0]
{'O': 0, 'B-PERSON': 1, 'I-PERSON': 2, 'B-ORGANIZATION': 3, 'I-ORGANIZATION': 4, 'B-LOCATION': 5, 'I-LOCATION': 6, 'B-MISCELLANEOUS': 7, 'I-MISCELLANEOUS': 8}

def flatten_list(list_):
    ret = []
    for sublist in list_:
        ret.extend(sublist)
    return ret


references = flatten_list(references)
predictions = flatten_list(predictions)
references = np.array(references)
predictions = np.array(predictions)

from sklearn.metrics import f1_score

We will use the F1 score to report the results.

table = Table(title="Results")

table.add_column("Class")
table.add_column("F1")
table.add_column("Nb samples", justify="center")

for class_name, class_value in NER_TAGS_ONTOLOGY.items():
    y_true = np.where(references == class_value, 1, 0)
    y_pred = np.where(predictions == class_value, 1, 0)
    table.add_row(
        class_name,
        f"{f1_score(y_true, y_pred) * 100:6.1f}%",
        f"{y_true.sum():3d}",
        end_section=True,
    )

# Group tokens regardless of their positions in the entities
NER_TAGS_ONTOLOGY_GROUPED = {
    "PERSON": (1, 2),
    "ORGANIZATION": (3, 4),
    "LOCATION": (5, 6),
    "MISCELLANEOUS": (7, 8),
}
for class_name, class_values in NER_TAGS_ONTOLOGY_GROUPED.items():
    y_true = np.where((references == class_values[0]) | (references == class_values[1]), 1, 0)
    y_pred = np.where((predictions == class_values[0]) | (predictions == class_values[1]), 1, 0)
    table.add_row(
        class_name,
        f"{f1_score(y_true, y_pred) * 100:6.1f}%",
        f"{y_true.sum():3d}",
        style="bold green",
        end_section=True,
    )


table.add_row(
    "All",
    f"{f1_score(references, predictions, average='weighted') * 100:6.1f}%",
    f"{len(references):3d}",
    style="bold bright_red",
)

console = Console()
console.print(table)

                 Results                  
┏━━━━━━━━━━━━━━━━━┳━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Class           ┃ F1      ┃ Nb samples ┃
┡━━━━━━━━━━━━━━━━━╇━━━━━━━━━╇━━━━━━━━━━━━┩
│ O               │   93.4% │    468     │
├─────────────────┼─────────┼────────────┤
│ B-PERSON        │   90.9% │     12     │
├─────────────────┼─────────┼────────────┤
│ I-PERSON        │  100.0% │     10     │
├─────────────────┼─────────┼────────────┤
│ B-ORGANIZATION  │   58.8% │     12     │
├─────────────────┼─────────┼────────────┤
│ I-ORGANIZATION  │   44.4% │      8     │
├─────────────────┼─────────┼────────────┤
│ B-LOCATION      │   71.4% │     14     │
├─────────────────┼─────────┼────────────┤
│ I-LOCATION      │    0.0% │      0     │
├─────────────────┼─────────┼────────────┤
│ B-MISCELLANEOUS │   10.3% │     15     │
├─────────────────┼─────────┼────────────┤
│ I-MISCELLANEOUS │   31.2% │      5     │
├─────────────────┼─────────┼────────────┤
│ PERSON          │   95.2% │     22     │
├─────────────────┼─────────┼────────────┤
│ ORGANIZATION    │   53.8% │     20     │
├─────────────────┼─────────┼────────────┤
│ LOCATION        │   71.4% │     14     │
├─────────────────┼─────────┼────────────┤
│ MISCELLANEOUS   │   19.7% │     20     │
├─────────────────┼─────────┼────────────┤
│ All             │   88.6% │    544     │
└─────────────────┴─────────┴────────────┘

Quite good!

As we can see, the pre-annotations are not perfect, but the LLM seems to be able to generate pre-annotations that are good enough to help us speed up the labelling process in future projects.

Conclusion

In this tutorial, we have seen how to use the OpenAI API to generate pre-annotations for a dataset. We have also seen how to import the data and the pre-annotations to a Kili project, and how to evaluate the quality of these pre-annotations.