How to Set Up a Kili LLM Static project

In this tutorial you'll learn how to create and import conversations in a Kili project with a custom interface for comparing LLM outputs.

Here are the steps we will follow:

Creating a Kili project with a custom interface
Import three conversations to the project

Creating a Kili Project with a Custom Interface

We will create a Kili project with a custom interface that includes several jobs for comparing LLM outputs.

Defining Three Levels of Annotation Jobs

To streamline the annotation process, we define three distinct levels of annotation jobs:

Completion: This job enables annotators to evaluate individual responses generated by LLMs. Each response is annotated separately.
Round: This job allows annotators to assess a single round of conversation, grouping all the LLM responses within that round under a single annotation.
Conversation: This job facilitates annotation at the conversation level, where the entire exchange can be evaluated as a whole.

In this example, we use a JSON interface that incorporates classifications at all these levels, enabling comprehensive annotation:

interface = {
    "jobs": {
        "CLASSIFICATION_JOB_AT_COMPLETION_LEVEL": {
            "content": {
                "categories": {
                    "TOO_SHORT": {"children": [], "name": "Too short", "id": "category1"},
                    "JUST_RIGHT": {"children": [], "name": "Just right", "id": "category2"},
                    "TOO_VERBOSE": {"children": [], "name": "Too verbose", "id": "category3"},
                },
                "input": "radio",
            },
            "instruction": "Verbosity",
            "level": "completion",
            "mlTask": "CLASSIFICATION",
            "required": 0,
            "isChild": False,
            "isNew": False,
        },
        "CLASSIFICATION_JOB_AT_COMPLETION_LEVEL_1": {
            "content": {
                "categories": {
                    "NO_ISSUES": {"children": [], "name": "No issues", "id": "category4"},
                    "MINOR_ISSUES": {"children": [], "name": "Minor issue(s)", "id": "category5"},
                    "MAJOR_ISSUES": {"children": [], "name": "Major issue(s)", "id": "category6"},
                },
                "input": "radio",
            },
            "instruction": "Instructions Following",
            "level": "completion",
            "mlTask": "CLASSIFICATION",
            "required": 0,
            "isChild": False,
            "isNew": False,
        },
        "CLASSIFICATION_JOB_AT_COMPLETION_LEVEL_2": {
            "content": {
                "categories": {
                    "NO_ISSUES": {"children": [], "name": "No issues", "id": "category7"},
                    "MINOR_INACCURACY": {
                        "children": [],
                        "name": "Minor inaccuracy",
                        "id": "category8",
                    },
                    "MAJOR_INACCURACY": {
                        "children": [],
                        "name": "Major inaccuracy",
                        "id": "category9",
                    },
                },
                "input": "radio",
            },
            "instruction": "Truthfulness",
            "level": "completion",
            "mlTask": "CLASSIFICATION",
            "required": 0,
            "isChild": False,
            "isNew": False,
        },
        "CLASSIFICATION_JOB_AT_COMPLETION_LEVEL_3": {
            "content": {
                "categories": {
                    "NO_ISSUES": {"children": [], "name": "No issues", "id": "category10"},
                    "MINOR_SAFETY_CONCERN": {
                        "children": [],
                        "name": "Minor safety concern",
                        "id": "category11",
                    },
                    "MAJOR_SAFETY_CONCERN": {
                        "children": [],
                        "name": "Major safety concern",
                        "id": "category12",
                    },
                },
                "input": "radio",
            },
            "instruction": "Harmlessness/Safety",
            "level": "completion",
            "mlTask": "CLASSIFICATION",
            "required": 0,
            "isChild": False,
            "isNew": False,
        },
        "COMPARISON_JOB": {
            "content": {
                "options": {
                    "IS_MUCH_BETTER": {"children": [], "name": "Is much better", "id": "option13"},
                    "IS_BETTER": {"children": [], "name": "Is better", "id": "option14"},
                    "IS_SLIGHTLY_BETTER": {
                        "children": [],
                        "name": "Is slightly better",
                        "id": "option15",
                    },
                    "TIE": {"children": [], "name": "Tie", "mutual": True, "id": "option16"},
                },
                "input": "radio",
            },
            "instruction": "Pick the best answer",
            "mlTask": "COMPARISON",
            "required": 1,
            "isChild": False,
            "isNew": False,
        },
        "CLASSIFICATION_JOB_AT_ROUND_LEVEL": {
            "content": {
                "categories": {
                    "BOTH_ARE_GOOD": {"children": [], "name": "Both are good", "id": "category17"},
                    "BOTH_ARE_BAD": {"children": [], "name": "Both are bad", "id": "category18"},
                },
                "input": "radio",
            },
            "instruction": "Overall quality",
            "level": "round",
            "mlTask": "CLASSIFICATION",
            "required": 0,
            "isChild": False,
            "isNew": False,
        },
        "CLASSIFICATION_JOB_AT_CONVERSATION_LEVEL": {
            "content": {
                "categories": {
                    "GLOBAL_GOOD": {"children": [], "name": "Globally good", "id": "category19"},
                    "BOTH_ARE_BAD": {"children": [], "name": "Globally bad", "id": "category20"},
                },
                "input": "radio",
            },
            "instruction": "Global",
            "level": "conversation",
            "mlTask": "CLASSIFICATION",
            "required": 0,
            "isChild": False,
            "isNew": False,
        },
        "TRANSCRIPTION_JOB_AT_CONVERSATION_LEVEL": {
            "content": {"input": "textField"},
            "instruction": "Additional comments...",
            "level": "conversation",
            "mlTask": "TRANSCRIPTION",
            "required": 0,
            "isChild": False,
            "isNew": False,
        },
    }
}

Now, we create the project using the create_project method, with type LLM_STATIC:

from kili.client import Kili

kili = Kili(
    # api_endpoint="https://cloud.kili-technology.com/api/label/v2/graphql",
)
project = kili.create_project(
    title="[Kili SDK Notebook]: LLM Static",
    description="Project Description",
    input_type="LLM_STATIC",
    json_interface=interface,
)
project_id = project["id"]

Import conversations

When working with LLM Static projects in Kili, each conversation represents a single data item (or asset). You must define:

A unique identifier for the conversation (externalId).
An array of messages (or chat items) under the key chatItems.

Each message in chatItems requires:

Content of message (content).
A unique identifier for the message (externalId).
A role that can be SYSTEM, USER, or ASSISTANT.
If the role is ASSISTANT, a modelName is also required, indicating which LLM produced the response.

Important: Your conversation must contain exactly two ASSISTANT entries per “round” (i.e., two or more LLM responses to the same user prompt).

Example: Unlabeled Conversation

Below is a sample 3-round conversation with two different LLMs (gpt-o1 and deepseek-r1). Each round contains a USER query followed by ASSISTANT responses from each model.

"chatItems": [
    {
        "externalId": "system-1",
        "role": "SYSTEM",
        "content": "You are an assistant helping the user by providing clear and concise answers."
    },
    {
        "externalId": "user-1",
        "role": "USER",
        "content": "What is the capital of France?"
    },
    {
        "externalId": "assistant-1a",
        "role": "ASSISTANT",
        "content": "The capital of France is Paris.",
        "modelName": "gpt-o1"
    },
    {
        "externalId": "assistant-1b",
        "role": "ASSISTANT",
        "content": "The capital of France is Paris.",
        "modelName": "deepseek-r1"
    },
    {
        "externalId": "user-2",
        "role": "USER",
        "content": "What is Paris famous for?"
    },
    {
        "externalId": "assistant-2a",
        "role": "ASSISTANT",
        "content": "Paris is renowned for its iconic landmarks, such as the Eiffel Tower, Louvre Museum, and Notre-Dame Cathedral. The city is also celebrated for its rich history, vibrant art scene, world-class cuisine, and status as a global center of fashion and romance.",
        "modelName": "deepseek-r1"
    },
    {
        "externalId": "assistant-2b",
        "role": "ASSISTANT",
        "content": "Paris is well-known for its art, fashion, and gastronomy, alongside iconic sites like the Eiffel Tower.",
        "modelName": "deepseek-r1"
    },
    {
        "externalId": "user-3",
        "role": "USER",
        "content": "What is the best time to visit Paris?"
    },
    {
        "externalId": "assistant-3a",
        "role": "ASSISTANT",
        "content": "The best time to visit Paris is in spring (April to June) or fall (September to October), when the weather is mild and crowds are smaller.",
        "modelName": "gpt-o1"
    },
    {
        "externalId": "assistant-3b",
        "role": "ASSISTANT",
        "content": "Spring and autumn are ideal for visiting Paris, as the weather is pleasant, and you can avoid peak tourist season.",
        "modelName": "deepseek-r1"
    }
]

You can load and import such a conversation into your Kili LLM Static project by passing this data to the import_conversations method via the Kili Python SDK.

kili.llm.import_conversations(project_id, conversations=conversations)

Importing labeled conversations

If you already have annotations for your conversations, you can include them in your import payload by adding the label and labeler fields.

Labeler

labeler : the email of the user who created the annotations.

Note that this user must belong to your Kili organization and have access to the project.

Label Structure

The label object contains up to three optional keys—completion, round, and conversation—corresponding to the three granularity levels of labeling in an LLM Static project.

{
  "label": {
      "completion": { ... },
      "round": { ... },
      "conversation": { ... },
  }
}

1. Conversation level

Use the conversation field to set annotations that apply to the entire conversation. For each classification job, provide a categories array, and for transcription jobs, provide a text property.

"conversation": {
    "CLASSIFICATION_JOB_AT_CONVERSATION_LEVEL": {
        "categories": [
            "GLOBAL_GOOD"
        ]
    },
    "TRANSCRIPTION_JOB_AT_CONVERSATION_LEVEL": {
        "text": "Great conversation!"
    }
}

💡 Tip: For transcription jobs, if the job has been configured with:
"content": {"input": "markdown"}
then the transcription can be edited in Markdown format

2. Round-Level Labels

Use the round field for jobs defined with level="round".

Key: The round number (starting at 0)
Value: The label details for that round

For comparison jobs (where mlTask is COMPARISON), you must additionally specify firstId and secondId — the IDs of the assistant messages being compared.

"round": {
    "CLASSIFICATION_JOB_AT_ROUND_LEVEL": {
        "0": {
          "categories": [
            "BOTH_ARE_GOOD"
          ]
        },
        "1": {
          "categories": [
            "BOTH_ARE_BAD"
          ]
        },
        "2": {
          "categories": [
            "BOTH_ARE_GOOD"
          ]
        }
      },
      "COMPARISON_JOB": {
        "0": {
          "code": "Is much better",
          "firstId": "assistant-1a",
          "secondId": "assistant-1b"
        },
        "1": {
          "code": "Is better",
          "firstId": "assistant-2a",
          "secondId": "assistant-2b"
        },
        "2": {
          "code": "Is much better",
          "firstId": "assistant-3b",
          "secondId": "assistant-3a"
        }
    }
}

3. Completion-Level Labels

Use the completion field for jobs defined at completion level (i.e label for each assistant response).

Key : The externalId of the assistant message
Value: The label details for that specific completion

"completion": {
    "CLASSIFICATION_JOB_AT_COMPLETION_LEVEL": {
        "assistant-1a": {
            "categories": [
                "TOO_SHORT"
            ]
        },
        "assistant-2a": {
            "categories": [
                "JUST_RIGHT"
            ]
        },
        "assistant-3b": {
            "categories": [
                "TOO_SHORT"
            ]
        }
    }
}

Full Example: Labeled Conversation

Here is a condensed example illustrating conversation, round, and completion labels. Notice each labeled conversation includes both label and labeler.

{
    "externalId": "conversation-42",
    "labeler": "annotator@example.com",
    "label": {
        "conversation": {
            "CLASSIFICATION_JOB_AT_CONVERSATION_LEVEL": {
                "categories": [
                    "GLOBAL_GOOD"
                ]
            }
        },
        "round": {
            "COMPARISON_JOB": {
                "0": {
                    "code": "IS_BETTER",
                    "firstId": "assistant-1a",
                    "secondId": "assistant-1b"
                }
            }
        },
        "completion": {
            "CLASSIFICATION_JOB_AT_COMPLETION_LEVEL": {
                "assistant-1a": {
                    "categories": [
                        "TOO_SHORT"
                    ]
                }
            }
        }
    },
    "chatItems": [
        {
            "externalId": "system-1",
            "role": "SYSTEM",
            "content": "System instructions."
        },
        {
            "externalId": "user-1",
            "role": "USER",
            "content": "What's the capital of France?"
        },
        {
            "externalId": "assistant-1a",
            "role": "ASSISTANT",
            "content": "Paris.",
            "modelName": "gpt-o1"
        },
        {
            "externalId": "assistant-1b",
            "role": "ASSISTANT",
            "content": "The capital is Paris, France.",
            "modelName": "deepseek-r1"
        }
    ]
}

Final Notes

Ensure each conversation and chat item has a unique externalId.
When including labeled data, the specified labeler must be a valid user in your Kili organization with access to the project.
Make sure your labeling schema aligns with the job definitions in your project’s JSON interface.

If you follow these guidelines, you’ll be able to seamlessly import both unlabeled and labeled conversations into your Kili LLM Static project.