How to import rich-text assets
When dealing with textual data, style can convey a lot of meaning. If you annotate a long list or a legal text, displaying structured text instead of plain boring text allows your annotator to rapidly grasp patterns within the document.
Setup
%pip install kili
import os
from random import random
from kili.client import Kili
kili = Kili(
# api_endpoint="https://cloud.kili-technology.com/api/label/v2/graphql",
# the line above can be uncommented and changed if you are working with an on-premise version of Kili
)
Discover rich-text Kili format
Let's create a named-entity recognition project that aims at labeling American legal documents.
We start by defining the ontology of our project, with several categories:
INSTRUCTIONS
PREAMBLE
RIGHTS
REFERENCE_TO_GOD
json_interface = {
"jobs": {
"JOB_0": {
"mlTask": "NAMED_ENTITIES_RECOGNITION",
"instruction": "Categories",
"required": 1,
"isChild": False,
"isVisible": True,
"content": {
"categories": {
"INSTRUCTIONS": {"name": "Instructions", "children": [], "color": "#cc4125"},
"PREAMBLE": {"name": "Preamble", "children": [], "color": "#ffd966"},
"RIGHTS": {"name": "Rights", "children": [], "color": "#76a5af"},
"REFERENCE_TO_GOD": {
"name": "Reference to God",
"children": [],
"color": "#c27ba0",
},
},
"input": "radio",
},
}
}
}
Next, we create a project:
project_id = kili.create_project(
json_interface=json_interface,
input_type="TEXT",
title="[Kili SDK Notebook]: Import text assets",
)["id"]
The method to use for adding assets to your Kili project is the kili.append_many_to_dataset()
method.
In case of rich text, you need to use the json_content_array
argument:
kili.append_many_to_dataset(
project_id=project_id,
json_content_array=[json_content_1, json_content_2, etc...],
)
Note that the content_array
parameter of this method is used for uploading standard text assets.
A json_content
object is a Python dict that contains nodes as keys.
Nodes can be either element nodes or text nodes.
- An element node can have children (that is a list of other elements or text nodes).
- By default, element nodes are
<div />
. - Possible types for an element node are:
blockquote
h1
h2
h3
h4
li
ol
p
table
tbody
td
thead
tr
ul
-
Available styles for a node (see Mozilla reference to learn more on CSS):
alignItems
alignSelf
background
backgroundColor
border
borderBottom
borderLeft
borderRadius
borderRight
borderTop
color
display
flexDirection
float
fontWeight
height
margin
marginBottom
marginLeft
marginRight
marginTop
maxHeight
maxWidth
minHeight
minWidth
padding
paddingBottom
paddingLeft
paddingRight
paddingTop
textAlign
textDecoration
textIndent
width
-
A text node can contain text.
- By default, text nodes are
<span />
. - Text nodes are identified by an
id
. The ID must be unique accross the whole document. This will allow for overlapping entities between two or more text nodes. - Available styles for a text node:
bold: true
code: true
italic: true
underline: true
display
float
fontWeight
margin
marginBottom
marginLeft
marginRight
marginTop
padding
paddingBottom
paddingLeft
paddingRight
paddingTop
textAlign
textDecoration
textIndent
Let's see some real examples with the corresponding result in Kili's interface!
We can start with a really simple example:
json_content = [
{
"children": [
{
"id": f"{random()}", # set your own id here
"text": "The unanimous Declaration of the thirteen United States of America.",
}
]
}
]
kili.append_many_to_dataset(
project_id=project_id,
json_content_array=[json_content],
)
You should be able to see the new asset in the Kili project:
Let's try to upload the asset again, but this time with some style:
json_content = [
{
"children": [
{
"id": f"{random()}",
"bold": True,
"underline": True,
"text": "The unanimous Declaration of the thirteen United States of America.",
}
]
}
]
kili.append_many_to_dataset(
project_id=project_id,
json_content_array=[json_content],
)
Another test with an even more complex style. We can try to mix styled text nodes with plain-text nodes:
json_content = [
{
"children": [
{
"type": "p",
"children": [
{
"id": f"{random()}",
"bold": True,
"underline": True,
"text": "The unanimous Declaration",
},
{
"id": f"{random()}",
"bold": True,
"text": " of the thirteen United States of America.",
},
{
"id": f"{random()}",
"text": (
"When in the Course of human events, it becomes necessary for one"
" people to dissolve the political bands which have connected them with"
" another, and to assume among the powers of the earth, the separate"
" and equal station to which the Laws of Nature and of Nature's God"
" entitle them, a decent respect to the opinions of mankind requires"
" that they should declare the causes which impel them to the"
" separation."
),
},
],
}
]
}
]
kili.append_many_to_dataset(
project_id=project_id,
json_content_array=[json_content],
)
Let's create a new asset with a title, a sub title and proper margins:
json_content = [
{
"children": [
{
"type": "h1",
"children": [
{
"border": "1px solid black",
"textAlign": "center",
"children": [{"id": f"{random()}", "text": "Declaration of Independence"}],
},
],
},
{
"type": "h2",
"children": [{"id": f"{random()}", "text": "In Congress, July 4, 1776"}],
},
{
"type": "p",
"children": [
{
"id": f"{random()}",
"bold": True,
"underline": True,
"text": "The unanimous Declaration",
},
{
"id": f"{random()}",
"bold": True,
"text": " of the thirteen United States of America.",
},
{
"id": f"{random()}",
"text": (
"When in the Course of human events, it becomes necessary for one"
" people to dissolve the political bands which have connected them with"
" another, and to assume among the powers of the earth, the separate"
" and equal station to which the Laws of Nature and of Nature's God"
" entitle them, a decent respect to the opinions of mankind requires"
" that they should declare the causes which impel them to the"
" separation."
),
},
],
},
{
"type": "p",
"marginLeft": "30px",
"marginRight": "30px",
"border": "red",
"children": [
{
"id": f"{random()}",
"text": (
"We hold these truths to be self-evident, that all men are created"
" equal, that they are endowed by their Creator with certain"
" unalienable Rights, that among these are Life, Liberty and the"
" pursuit of Happiness."
),
},
{
"type": "ul",
"children": [
{
"type": "li",
"children": [
{
"id": f"{random()}",
"text": (
"That to secure these rights, Governments are"
" instituted among Men, deriving their just powers from"
" the consent of the governed,"
),
},
],
},
{
"type": "li",
"children": [
{
"id": f"{random()}",
"text": (
"That whenever any Form of Government becomes"
" destructive of these ends, it is the Right of the"
" People to alter or to abolish it, and to institute"
" new Government, laying its foundation on such"
" principles and organizing its powers in such form, as"
" to them shall seem most likely to effect their Safety"
" and Happiness."
),
},
],
},
],
},
{
"id": f"{random()}",
"text": (
"Prudence, indeed, will dictate that Governments long established"
" should not be changed for light and transient causes; and accordingly"
" all experience hath shewn, that mankind are more disposed to suffer,"
" while evils are sufferable, than to right themselves by abolishing"
" the forms to which they are accustomed. But when a long train of"
" abuses and usurpations, pursuing invariably the same Object evinces a"
" design to reduce them under absolute Despotism, it is their right, it"
" is their duty, to throw off such Government, and to provide new"
" Guards for their future security.--Such has been the patient"
" sufferance of these Colonies; and such is now the necessity which"
" constrains them to alter their former Systems of Government. The"
" history of the present King of Great Britain is a history of repeated"
" injuries and usurpations, all having in direct object the"
" establishment of an absolute Tyranny over these States. To prove"
" this, let Facts be submitted to a candid world."
),
},
],
},
],
},
]
kili.append_many_to_dataset(
project_id=project_id,
json_content_array=[json_content],
)
Convert HTML to rich-text Kili format
To make the whole process as seamless as possible, let's use simple Python code to transform raw HTML snippets directly into the Kili format.
For this purpose, we will use BeautifulSoup 4, and will try to convert a simple table from Mozilla doc:
%pip install beautifulsoup4
from bs4 import BeautifulSoup
def from_html_to_kili(html: BeautifulSoup):
styles = {
"table": {"border": "1px solid #333"},
"td": {"border": "1px solid #333"},
"th": {"backgroundColor": "#333", "color": "#fff"},
}
if html is None:
return html
attributes = dict(styles[html.name]) if html.name in styles else {}
children = [from_html_to_kili(child) for child in html.findChildren(recursive=False)]
if len(children) == 0:
children.append({"id": f"{random()}", "text": html.text.strip("\n").strip()})
if len(children) != 0:
attributes["children"] = children
if html.name != "[document]":
attributes["type"] = html.name
return attributes
Below is our html content to convert to Kili json content:
html_doc = """
<table>
<thead>
<tr>
<th colspan="2">The table header</th>
</tr>
</thead>
<tbody>
<tr>
<td>The table body</td>
<td border="1px solid #333">with two columns</td>
</tr>
</tbody>
</table>
"""
Let's convert it using our custom method:
soup = BeautifulSoup(html_doc, "html.parser")
json_content = [from_html_to_kili(soup)]
print(json_content)
[{'children': [{'border': '1px solid #333', 'children': [{'children': [{'children': [{'backgroundColor': '#333', 'color': '#fff', 'children': [{'id': '0.06746091905311846', 'text': 'The table header'}], 'type': 'th'}], 'type': 'tr'}], 'type': 'thead'}, {'children': [{'children': [{'border': '1px solid #333', 'children': [{'id': '0.6734828512730971', 'text': 'The table body'}], 'type': 'td'}, {'border': '1px solid #333', 'children': [{'id': '0.38249704337851276', 'text': 'with two columns'}], 'type': 'td'}], 'type': 'tr'}], 'type': 'tbody'}], 'type': 'table'}]}]
Finally, we upload it to our Kili project:
kili.append_many_to_dataset(
project_id=project_id,
json_content_array=[json_content],
)
OIn Kili, the table looks like this:
Congrats! 👏
In this tutorial, we have learned how to import rich-text assets with styled text nodes into a Kili project.
Now, armed with these techniques, you can start building your own Kili projects with styled and structured text data!
Cleanup
We can remove the project that we created:
kili.delete_project(project_id)