Skip to content

Open In Colab

How to import rich-text assets

When dealing with textual data, style can convey a lot of meaning. If you annotate a long list or a legal text, displaying structured text instead of plain boring text allows your annotator to rapidly grasp patterns within the document.

Setup

%pip install kili
from random import random

from kili.client import Kili
kili = Kili(
    # api_endpoint="https://cloud.kili-technology.com/api/label/v2/graphql",
    # the line above can be uncommented and changed if you are working with an on-premise version of Kili
)

Discover rich-text Kili format

Let's create a named-entity recognition project that aims at labeling American legal documents.

We start by defining the ontology of our project, with several categories:

  • INSTRUCTIONS
  • PREAMBLE
  • RIGHTS
  • REFERENCE_TO_GOD
json_interface = {
    "jobs": {
        "JOB_0": {
            "mlTask": "NAMED_ENTITIES_RECOGNITION",
            "instruction": "Categories",
            "required": 1,
            "isChild": False,
            "isVisible": True,
            "content": {
                "categories": {
                    "INSTRUCTIONS": {"name": "Instructions", "children": [], "color": "#cc4125"},
                    "PREAMBLE": {"name": "Preamble", "children": [], "color": "#ffd966"},
                    "RIGHTS": {"name": "Rights", "children": [], "color": "#76a5af"},
                    "REFERENCE_TO_GOD": {
                        "name": "Reference to God",
                        "children": [],
                        "color": "#c27ba0",
                    },
                },
                "input": "radio",
            },
        }
    }
}

Next, we create a project:

project_id = kili.create_project(
    json_interface=json_interface,
    input_type="TEXT",
    title="[Kili SDK Notebook]: Import text assets",
)["id"]

The method to use for adding assets to your Kili project is the kili.append_many_to_dataset() method.

In case of rich text, you need to use the json_content_array argument:

kili.append_many_to_dataset(
    project_id=project_id,
    json_content_array=[json_content_1, json_content_2, etc...],
)

Note that the content_array parameter of this method is used for uploading standard text assets.

A json_content object is a Python dict that contains nodes as keys.

Nodes can be either element nodes or text nodes.

  • An element node can have children (that is a list of other elements or text nodes).
  • By default, element nodes are <div />.
  • Possible types for an element node are:
    • blockquote
    • h1
    • h2
    • h3
    • h4
    • li
    • ol
    • p
    • table
    • tbody
    • td
    • thead
    • tr
    • ul
  • Available styles for a node (see Mozilla reference to learn more on CSS):

    • alignItems
    • alignSelf
    • background
    • backgroundColor
    • border
    • borderBottom
    • borderLeft
    • borderRadius
    • borderRight
    • borderTop
    • color
    • display
    • flexDirection
    • float
    • fontWeight
    • height
    • margin
    • marginBottom
    • marginLeft
    • marginRight
    • marginTop
    • maxHeight
    • maxWidth
    • minHeight
    • minWidth
    • padding
    • paddingBottom
    • paddingLeft
    • paddingRight
    • paddingTop
    • textAlign
    • textDecoration
    • textIndent
    • width
  • A text node can contain text.

  • By default, text nodes are <span />.
  • Text nodes are identified by an id. The ID must be unique accross the whole document. This will allow for overlapping entities between two or more text nodes.
  • Available styles for a text node:
    • bold: true
    • code: true
    • italic: true
    • underline: true
    • display
    • float
    • fontWeight
    • margin
    • marginBottom
    • marginLeft
    • marginRight
    • marginTop
    • padding
    • paddingBottom
    • paddingLeft
    • paddingRight
    • paddingTop
    • textAlign
    • textDecoration
    • textIndent

Let's see some real examples with the corresponding result in Kili's interface!

We can start with a really simple example:

json_content = [
    {
        "children": [
            {
                "id": f"{random()}",  # set your own id here
                "text": "The unanimous Declaration of the thirteen United States of America.",
            }
        ]
    }
]

kili.append_many_to_dataset(
    project_id=project_id,
    json_content_array=[json_content],
)

You should be able to see the new asset in the Kili project:

image.png

Let's try to upload the asset again, but this time with some style:

json_content = [
    {
        "children": [
            {
                "id": f"{random()}",
                "bold": True,
                "underline": True,
                "text": "The unanimous Declaration of the thirteen United States of America.",
            }
        ]
    }
]

kili.append_many_to_dataset(
    project_id=project_id,
    json_content_array=[json_content],
)

image.png

Another test with an even more complex style. We can try to mix styled text nodes with plain-text nodes:

json_content = [
    {
        "children": [
            {
                "type": "p",
                "children": [
                    {
                        "id": f"{random()}",
                        "bold": True,
                        "underline": True,
                        "text": "The unanimous Declaration",
                    },
                    {
                        "id": f"{random()}",
                        "bold": True,
                        "text": " of the thirteen United States of America.",
                    },
                    {
                        "id": f"{random()}",
                        "text": (
                            "When in the Course of human events, it becomes necessary for one"
                            " people to dissolve the political bands which have connected them with"
                            " another, and to assume among the powers of the earth, the separate"
                            " and equal station to which the Laws of Nature and of Nature's God"
                            " entitle them, a decent respect to the opinions of mankind requires"
                            " that they should declare the causes which impel them to the"
                            " separation."
                        ),
                    },
                ],
            }
        ]
    }
]

kili.append_many_to_dataset(
    project_id=project_id,
    json_content_array=[json_content],
)

image.png

Let's create a new asset with a title, a sub title and proper margins:

json_content = [
    {
        "children": [
            {
                "type": "h1",
                "children": [
                    {
                        "border": "1px solid black",
                        "textAlign": "center",
                        "children": [{"id": f"{random()}", "text": "Declaration of Independence"}],
                    },
                ],
            },
            {
                "type": "h2",
                "children": [{"id": f"{random()}", "text": "In Congress, July 4, 1776"}],
            },
            {
                "type": "p",
                "children": [
                    {
                        "id": f"{random()}",
                        "bold": True,
                        "underline": True,
                        "text": "The unanimous Declaration",
                    },
                    {
                        "id": f"{random()}",
                        "bold": True,
                        "text": " of the thirteen United States of America.",
                    },
                    {
                        "id": f"{random()}",
                        "text": (
                            "When in the Course of human events, it becomes necessary for one"
                            " people to dissolve the political bands which have connected them with"
                            " another, and to assume among the powers of the earth, the separate"
                            " and equal station to which the Laws of Nature and of Nature's God"
                            " entitle them, a decent respect to the opinions of mankind requires"
                            " that they should declare the causes which impel them to the"
                            " separation."
                        ),
                    },
                ],
            },
            {
                "type": "p",
                "marginLeft": "30px",
                "marginRight": "30px",
                "border": "red",
                "children": [
                    {
                        "id": f"{random()}",
                        "text": (
                            "We hold these truths to be self-evident, that all men are created"
                            " equal, that they are endowed by their Creator with certain"
                            " unalienable Rights, that among these are Life, Liberty and the"
                            " pursuit of Happiness."
                        ),
                    },
                    {
                        "type": "ul",
                        "children": [
                            {
                                "type": "li",
                                "children": [
                                    {
                                        "id": f"{random()}",
                                        "text": (
                                            "That to secure these rights, Governments are"
                                            " instituted among Men, deriving their just powers from"
                                            " the consent of the governed,"
                                        ),
                                    },
                                ],
                            },
                            {
                                "type": "li",
                                "children": [
                                    {
                                        "id": f"{random()}",
                                        "text": (
                                            "That whenever any Form of Government becomes"
                                            " destructive of these ends, it is the Right of the"
                                            " People to alter or to abolish it, and to institute"
                                            " new Government, laying its foundation on such"
                                            " principles and organizing its powers in such form, as"
                                            " to them shall seem most likely to effect their Safety"
                                            " and Happiness."
                                        ),
                                    },
                                ],
                            },
                        ],
                    },
                    {
                        "id": f"{random()}",
                        "text": (
                            "Prudence, indeed, will dictate that Governments long established"
                            " should not be changed for light and transient causes; and accordingly"
                            " all experience hath shewn, that mankind are more disposed to suffer,"
                            " while evils are sufferable, than to right themselves by abolishing"
                            " the forms to which they are accustomed. But when a long train of"
                            " abuses and usurpations, pursuing invariably the same Object evinces a"
                            " design to reduce them under absolute Despotism, it is their right, it"
                            " is their duty, to throw off such Government, and to provide new"
                            " Guards for their future security.--Such has been the patient"
                            " sufferance of these Colonies; and such is now the necessity which"
                            " constrains them to alter their former Systems of Government. The"
                            " history of the present King of Great Britain is a history of repeated"
                            " injuries and usurpations, all having in direct object the"
                            " establishment of an absolute Tyranny over these States. To prove"
                            " this, let Facts be submitted to a candid world."
                        ),
                    },
                ],
            },
        ],
    },
]

kili.append_many_to_dataset(
    project_id=project_id,
    json_content_array=[json_content],
)

image.png

Convert HTML to rich-text Kili format

To make the whole process as seamless as possible, let's use simple Python code to transform raw HTML snippets directly into the Kili format.

For this purpose, we will use BeautifulSoup 4, and will try to convert a simple table from Mozilla doc:

%pip install beautifulsoup4
from bs4 import BeautifulSoup
def from_html_to_kili(html: BeautifulSoup):
    styles = {
        "table": {"border": "1px solid #333"},
        "td": {"border": "1px solid #333"},
        "th": {"backgroundColor": "#333", "color": "#fff"},
    }

    if html is None:
        return html

    attributes = dict(styles[html.name]) if html.name in styles else {}
    children = [from_html_to_kili(child) for child in html.findChildren(recursive=False)]
    if len(children) == 0:
        children.append({"id": f"{random()}", "text": html.text.strip("\n").strip()})
    if len(children) != 0:
        attributes["children"] = children
        if html.name != "[document]":
            attributes["type"] = html.name
    return attributes

Below is our html content to convert to Kili json content:

html_doc = """
<table>
    <thead>
        <tr>
            <th colspan="2">The table header</th>
        </tr>
    </thead>
    <tbody>
        <tr>
            <td>The table body</td>
            <td border="1px solid #333">with two columns</td>
        </tr>
    </tbody>
</table>
"""

Let's convert it using our custom method:

soup = BeautifulSoup(html_doc, "html.parser")
json_content = [from_html_to_kili(soup)]
print(json_content)
[{'children': [{'border': '1px solid #333', 'children': [{'children': [{'children': [{'backgroundColor': '#333', 'color': '#fff', 'children': [{'id': '0.06746091905311846', 'text': 'The table header'}], 'type': 'th'}], 'type': 'tr'}], 'type': 'thead'}, {'children': [{'children': [{'border': '1px solid #333', 'children': [{'id': '0.6734828512730971', 'text': 'The table body'}], 'type': 'td'}, {'border': '1px solid #333', 'children': [{'id': '0.38249704337851276', 'text': 'with two columns'}], 'type': 'td'}], 'type': 'tr'}], 'type': 'tbody'}], 'type': 'table'}]}]

Finally, we upload it to our Kili project:

kili.append_many_to_dataset(
    project_id=project_id,
    json_content_array=[json_content],
)

OIn Kili, the table looks like this:

image.png

Congrats! 👏

In this tutorial, we have learned how to import rich-text assets with styled text nodes into a Kili project.

Now, armed with these techniques, you can start building your own Kili projects with styled and structured text data!

Cleanup

We can remove the project that we created:

kili.delete_project(project_id)