How to Set Up a Kili LLM Static project
In this tutorial you'll learn how to create and import conversations in a Kili project with a custom interface for comparing LLM outputs.
Here are the steps we will follow:
- Creating a Kili project with a custom interface
- Import three conversations to the project
Creating a Kili Project with a Custom Interface
We will create a Kili project with a custom interface that includes several jobs for comparing LLM outputs.
Defining Three Levels of Annotation Jobs
To streamline the annotation process, we define three distinct levels of annotation jobs:
-
Completion: This job enables annotators to evaluate individual responses generated by LLMs. Each response is annotated separately.
-
Round: This job allows annotators to assess a single round of conversation, grouping all the LLM responses within that round under a single annotation.
-
Conversation: This job facilitates annotation at the conversation level, where the entire exchange can be evaluated as a whole.
In this example, we use a JSON interface that incorporates classifications at all these levels, enabling comprehensive annotation:
interface = {
"jobs": {
"CLASSIFICATION_JOB_AT_COMPLETION_LEVEL": {
"content": {
"categories": {
"TOO_SHORT": {"children": [], "name": "Too short", "id": "category1"},
"JUST_RIGHT": {"children": [], "name": "Just right", "id": "category2"},
"TOO_VERBOSE": {"children": [], "name": "Too verbose", "id": "category3"},
},
"input": "radio",
},
"instruction": "Verbosity",
"level": "completion",
"mlTask": "CLASSIFICATION",
"required": 0,
"isChild": False,
"isNew": False,
},
"CLASSIFICATION_JOB_AT_COMPLETION_LEVEL_1": {
"content": {
"categories": {
"NO_ISSUES": {"children": [], "name": "No issues", "id": "category4"},
"MINOR_ISSUES": {"children": [], "name": "Minor issue(s)", "id": "category5"},
"MAJOR_ISSUES": {"children": [], "name": "Major issue(s)", "id": "category6"},
},
"input": "radio",
},
"instruction": "Instructions Following",
"level": "completion",
"mlTask": "CLASSIFICATION",
"required": 0,
"isChild": False,
"isNew": False,
},
"CLASSIFICATION_JOB_AT_COMPLETION_LEVEL_2": {
"content": {
"categories": {
"NO_ISSUES": {"children": [], "name": "No issues", "id": "category7"},
"MINOR_INACCURACY": {
"children": [],
"name": "Minor inaccuracy",
"id": "category8",
},
"MAJOR_INACCURACY": {
"children": [],
"name": "Major inaccuracy",
"id": "category9",
},
},
"input": "radio",
},
"instruction": "Truthfulness",
"level": "completion",
"mlTask": "CLASSIFICATION",
"required": 0,
"isChild": False,
"isNew": False,
},
"CLASSIFICATION_JOB_AT_COMPLETION_LEVEL_3": {
"content": {
"categories": {
"NO_ISSUES": {"children": [], "name": "No issues", "id": "category10"},
"MINOR_SAFETY_CONCERN": {
"children": [],
"name": "Minor safety concern",
"id": "category11",
},
"MAJOR_SAFETY_CONCERN": {
"children": [],
"name": "Major safety concern",
"id": "category12",
},
},
"input": "radio",
},
"instruction": "Harmlessness/Safety",
"level": "completion",
"mlTask": "CLASSIFICATION",
"required": 0,
"isChild": False,
"isNew": False,
},
"COMPARISON_JOB": {
"content": {
"options": {
"IS_MUCH_BETTER": {"children": [], "name": "Is much better", "id": "option13"},
"IS_BETTER": {"children": [], "name": "Is better", "id": "option14"},
"IS_SLIGHTLY_BETTER": {
"children": [],
"name": "Is slightly better",
"id": "option15",
},
"TIE": {"children": [], "name": "Tie", "mutual": True, "id": "option16"},
},
"input": "radio",
},
"instruction": "Pick the best answer",
"mlTask": "COMPARISON",
"required": 1,
"isChild": False,
"isNew": False,
},
"CLASSIFICATION_JOB_AT_ROUND_LEVEL": {
"content": {
"categories": {
"BOTH_ARE_GOOD": {"children": [], "name": "Both are good", "id": "category17"},
"BOTH_ARE_BAD": {"children": [], "name": "Both are bad", "id": "category18"},
},
"input": "radio",
},
"instruction": "Overall quality",
"level": "round",
"mlTask": "CLASSIFICATION",
"required": 0,
"isChild": False,
"isNew": False,
},
"CLASSIFICATION_JOB_AT_CONVERSATION_LEVEL": {
"content": {
"categories": {
"GLOBAL_GOOD": {"children": [], "name": "Globally good", "id": "category19"},
"BOTH_ARE_BAD": {"children": [], "name": "Globally bad", "id": "category20"},
},
"input": "radio",
},
"instruction": "Global",
"level": "conversation",
"mlTask": "CLASSIFICATION",
"required": 0,
"isChild": False,
"isNew": False,
},
"TRANSCRIPTION_JOB_AT_CONVERSATION_LEVEL": {
"content": {"input": "textField"},
"instruction": "Additional comments...",
"level": "conversation",
"mlTask": "TRANSCRIPTION",
"required": 0,
"isChild": False,
"isNew": False,
},
}
}
Now, we create the project using the create_project
method, with type LLM_STATIC
:
from kili.client import Kili
kili = Kili(
# api_endpoint="https://cloud.kili-technology.com/api/label/v2/graphql",
)
project = kili.create_project(
title="[Kili SDK Notebook]: LLM Static",
description="Project Description",
input_type="LLM_STATIC",
json_interface=interface,
)
project_id = project["id"]
Import conversations
When working with LLM Static projects in Kili, each conversation represents a single data item (or asset). You must define:
- A unique identifier for the conversation (
externalId
). - An array of messages (or chat items) under the key
chatItems
.
Each message in chatItems requires:
- Content of message (
content
). - A unique identifier for the message (
externalId
). - A
role
that can beSYSTEM
,USER
, orASSISTANT
. - If the role is
ASSISTANT
, amodelName
is also required, indicating which LLM produced the response.
Important: Your conversation must contain exactly two ASSISTANT entries per “round” (i.e., two or more LLM responses to the same user prompt).
Example: Unlabeled Conversation
Below is a sample 3-round conversation with two different LLMs (gpt-o1
and deepseek-r1
). Each round contains a
USER query followed by ASSISTANT responses from each model.
"chatItems": [
{
"externalId": "system-1",
"role": "SYSTEM",
"content": "You are an assistant helping the user by providing clear and concise answers."
},
{
"externalId": "user-1",
"role": "USER",
"content": "What is the capital of France?"
},
{
"externalId": "assistant-1a",
"role": "ASSISTANT",
"content": "The capital of France is Paris.",
"modelName": "gpt-o1"
},
{
"externalId": "assistant-1b",
"role": "ASSISTANT",
"content": "The capital of France is Paris.",
"modelName": "deepseek-r1"
},
{
"externalId": "user-2",
"role": "USER",
"content": "What is Paris famous for?"
},
{
"externalId": "assistant-2a",
"role": "ASSISTANT",
"content": "Paris is renowned for its iconic landmarks, such as the Eiffel Tower, Louvre Museum, and Notre-Dame Cathedral. The city is also celebrated for its rich history, vibrant art scene, world-class cuisine, and status as a global center of fashion and romance.",
"modelName": "deepseek-r1"
},
{
"externalId": "assistant-2b",
"role": "ASSISTANT",
"content": "Paris is well-known for its art, fashion, and gastronomy, alongside iconic sites like the Eiffel Tower.",
"modelName": "deepseek-r1"
},
{
"externalId": "user-3",
"role": "USER",
"content": "What is the best time to visit Paris?"
},
{
"externalId": "assistant-3a",
"role": "ASSISTANT",
"content": "The best time to visit Paris is in spring (April to June) or fall (September to October), when the weather is mild and crowds are smaller.",
"modelName": "gpt-o1"
},
{
"externalId": "assistant-3b",
"role": "ASSISTANT",
"content": "Spring and autumn are ideal for visiting Paris, as the weather is pleasant, and you can avoid peak tourist season.",
"modelName": "deepseek-r1"
}
]
You can load and import such a conversation into your Kili LLM Static project by passing this data to the import_conversations
method via the Kili Python SDK.
kili.llm.import_conversations(project_id, conversations=conversations)
Importing labeled conversations
If you already have annotations for your conversations, you can include them in your import payload by adding the label
and labeler
fields.
Labeler
labeler
: the email of the user who created the annotations.
Note that this user must belong to your Kili organization and have access to the project.
Label Structure
The label object contains up to three optional keys—completion, round, and conversation—corresponding to the three granularity levels of labeling in an LLM Static project.
{
"label": {
"completion": { ... },
"round": { ... },
"conversation": { ... },
}
}
1. Conversation level
Use the conversation
field to set annotations that apply to the entire conversation.
For each classification job, provide a categories
array, and for transcription jobs, provide a text
property.
"conversation": {
"CLASSIFICATION_JOB_AT_CONVERSATION_LEVEL": {
"categories": [
"GLOBAL_GOOD"
]
},
"TRANSCRIPTION_JOB_AT_CONVERSATION_LEVEL": {
"text": "Great conversation!"
}
}
💡 Tip: For transcription jobs, if the job has been configured with:
then the transcription can be edited in Markdown format"content": {"input": "markdown"}
2. Round-Level Labels
Use the round field for jobs defined with level="round".
- Key: The round number (starting at 0)
- Value: The label details for that round
For comparison jobs (where mlTask
is COMPARISON
), you must additionally specify firstId
and secondId
— the IDs
of the assistant messages being compared.
"round": {
"CLASSIFICATION_JOB_AT_ROUND_LEVEL": {
"0": {
"categories": [
"BOTH_ARE_GOOD"
]
},
"1": {
"categories": [
"BOTH_ARE_BAD"
]
},
"2": {
"categories": [
"BOTH_ARE_GOOD"
]
}
},
"COMPARISON_JOB": {
"0": {
"code": "Is much better",
"firstId": "assistant-1a",
"secondId": "assistant-1b"
},
"1": {
"code": "Is better",
"firstId": "assistant-2a",
"secondId": "assistant-2b"
},
"2": {
"code": "Is much better",
"firstId": "assistant-3b",
"secondId": "assistant-3a"
}
}
}
3. Completion-Level Labels
Use the completion field for jobs defined at completion level (i.e label for each assistant response).
- Key : The
externalId
of the assistant message - Value: The label details for that specific completion
"completion": {
"CLASSIFICATION_JOB_AT_COMPLETION_LEVEL": {
"assistant-1a": {
"categories": [
"TOO_SHORT"
]
},
"assistant-2a": {
"categories": [
"JUST_RIGHT"
]
},
"assistant-3b": {
"categories": [
"TOO_SHORT"
]
}
}
}
Full Example: Labeled Conversation
Here is a condensed example illustrating conversation, round, and completion labels. Notice each labeled conversation includes both label and labeler.
{
"externalId": "conversation-42",
"labeler": "annotator@example.com",
"label": {
"conversation": {
"CLASSIFICATION_JOB_AT_CONVERSATION_LEVEL": {
"categories": [
"GLOBAL_GOOD"
]
}
},
"round": {
"COMPARISON_JOB": {
"0": {
"code": "IS_BETTER",
"firstId": "assistant-1a",
"secondId": "assistant-1b"
}
}
},
"completion": {
"CLASSIFICATION_JOB_AT_COMPLETION_LEVEL": {
"assistant-1a": {
"categories": [
"TOO_SHORT"
]
}
}
}
},
"chatItems": [
{
"externalId": "system-1",
"role": "SYSTEM",
"content": "System instructions."
},
{
"externalId": "user-1",
"role": "USER",
"content": "What's the capital of France?"
},
{
"externalId": "assistant-1a",
"role": "ASSISTANT",
"content": "Paris.",
"modelName": "gpt-o1"
},
{
"externalId": "assistant-1b",
"role": "ASSISTANT",
"content": "The capital is Paris, France.",
"modelName": "deepseek-r1"
}
]
}
Final Notes
- Ensure each conversation and chat item has a unique
externalId
. - When including labeled data, the specified
labeler
must be a valid user in your Kili organization with access to the project. - Make sure your labeling schema aligns with the job definitions in your project’s JSON interface.
If you follow these guidelines, you’ll be able to seamlessly import both unlabeled and labeled conversations into your Kili LLM Static project.