How to Set Up a Kili LLM Static project

In this tutorial you'll learn how to create and import conversations in a Kili project with a custom interface for comparing LLM outputs.

Here are the steps we will follow:

Creating a Kili project with a custom interface
Import three conversations to the project

Creating a Kili Project with a Custom Interface

We will create a Kili project with a custom interface that includes several jobs for comparing LLM outputs.

Defining Three Levels of Annotation Jobs

To streamline the annotation process, we define three distinct levels of annotation jobs:

Completion: This job enables annotators to evaluate individual responses generated by LLMs. Each response is annotated separately.
Round: This job allows annotators to assess a single round of conversation, grouping all the LLM responses within that round under a single annotation.
Conversation: This job facilitates annotation at the conversation level, where the entire exchange can be evaluated as a whole.

In this example, we use a JSON interface that incorporates classifications at all these levels, enabling comprehensive annotation:

interface = {
    "jobs": {
        "CLASSIFICATION_JOB_AT_COMPLETION_LEVEL": {
            "content": {
                "categories": {
                    "TOO_SHORT": {"children": [], "name": "Too short", "id": "category1"},
                    "JUST_RIGHT": {"children": [], "name": "Just right", "id": "category2"},
                    "TOO_VERBOSE": {"children": [], "name": "Too verbose", "id": "category3"},
                },
                "input": "radio",
            },
            "instruction": "Verbosity",
            "level": "completion",
            "mlTask": "CLASSIFICATION",
            "required": 0,
            "isChild": False,
            "isNew": False,
        },
        "CLASSIFICATION_JOB_AT_COMPLETION_LEVEL_1": {
            "content": {
                "categories": {
                    "NO_ISSUES": {"children": [], "name": "No issues", "id": "category4"},
                    "MINOR_ISSUES": {"children": [], "name": "Minor issue(s)", "id": "category5"},
                    "MAJOR_ISSUES": {"children": [], "name": "Major issue(s)", "id": "category6"},
                },
                "input": "radio",
            },
            "instruction": "Instructions Following",
            "level": "completion",
            "mlTask": "CLASSIFICATION",
            "required": 0,
            "isChild": False,
            "isNew": False,
        },
        "CLASSIFICATION_JOB_AT_COMPLETION_LEVEL_2": {
            "content": {
                "categories": {
                    "NO_ISSUES": {"children": [], "name": "No issues", "id": "category7"},
                    "MINOR_INACCURACY": {
                        "children": [],
                        "name": "Minor inaccuracy",
                        "id": "category8",
                    },
                    "MAJOR_INACCURACY": {
                        "children": [],
                        "name": "Major inaccuracy",
                        "id": "category9",
                    },
                },
                "input": "radio",
            },
            "instruction": "Truthfulness",
            "level": "completion",
            "mlTask": "CLASSIFICATION",
            "required": 0,
            "isChild": False,
            "isNew": False,
        },
        "CLASSIFICATION_JOB_AT_COMPLETION_LEVEL_3": {
            "content": {
                "categories": {
                    "NO_ISSUES": {"children": [], "name": "No issues", "id": "category10"},
                    "MINOR_SAFETY_CONCERN": {
                        "children": [],
                        "name": "Minor safety concern",
                        "id": "category11",
                    },
                    "MAJOR_SAFETY_CONCERN": {
                        "children": [],
                        "name": "Major safety concern",
                        "id": "category12",
                    },
                },
                "input": "radio",
            },
            "instruction": "Harmlessness/Safety",
            "level": "completion",
            "mlTask": "CLASSIFICATION",
            "required": 0,
            "isChild": False,
            "isNew": False,
        },
        "COMPARISON_JOB": {
            "content": {
                "options": {
                    "IS_MUCH_BETTER": {"children": [], "name": "Is much better", "id": "option13"},
                    "IS_BETTER": {"children": [], "name": "Is better", "id": "option14"},
                    "IS_SLIGHTLY_BETTER": {
                        "children": [],
                        "name": "Is slightly better",
                        "id": "option15",
                    },
                    "TIE": {"children": [], "name": "Tie", "mutual": True, "id": "option16"},
                },
                "input": "radio",
            },
            "instruction": "Pick the best answer",
            "mlTask": "COMPARISON",
            "required": 1,
            "isChild": False,
            "isNew": False,
        },
        "CLASSIFICATION_JOB_AT_ROUND_LEVEL": {
            "content": {
                "categories": {
                    "BOTH_ARE_GOOD": {"children": [], "name": "Both are good", "id": "category17"},
                    "BOTH_ARE_BAD": {"children": [], "name": "Both are bad", "id": "category18"},
                },
                "input": "radio",
            },
            "instruction": "Overall quality",
            "level": "round",
            "mlTask": "CLASSIFICATION",
            "required": 0,
            "isChild": False,
            "isNew": False,
        },
        "CLASSIFICATION_JOB_AT_CONVERSATION_LEVEL": {
            "content": {
                "categories": {
                    "GLOBAL_GOOD": {"children": [], "name": "Globally good", "id": "category19"},
                    "BOTH_ARE_BAD": {"children": [], "name": "Globally bad", "id": "category20"},
                },
                "input": "radio",
            },
            "instruction": "Global",
            "level": "conversation",
            "mlTask": "CLASSIFICATION",
            "required": 0,
            "isChild": False,
            "isNew": False,
        },
        "TRANSCRIPTION_JOB_AT_CONVERSATION_LEVEL": {
            "content": {"input": "textField"},
            "instruction": "Additional comments...",
            "level": "conversation",
            "mlTask": "TRANSCRIPTION",
            "required": 0,
            "isChild": False,
            "isNew": False,
        },
    }
}

Now, we create the project using the create_project method, with type LLM_STATIC:

from kili.client import Kili

kili = Kili(
    # api_endpoint="https://cloud.kili-technology.com/api/label/v2/graphql",
)
project = kili.create_project(
    title="[Kili SDK Notebook]: LLM Static",
    description="Project Description",
    input_type="LLM_STATIC",
    json_interface=interface,
)
project_id = project["id"]

Import conversations

We will import three conversations to the project. The conversations are stored in a JSON file, which we will load and import using the import_conversations method.

import requests

conversations = requests.get(
    "https://storage.googleapis.com/label-public-staging/demo-projects/LLM_static/llm-conversations.json"
).json()
kili.llm.import_conversations(project_id, conversations=conversations)

You can now see the conversations imported in the UI :

Model Integration

In this tutorial, we've:

Created a Kili project with a custom interface for LLM output comparison.
Imported conversations using Kili LLM format.