Skip to content

Open In Colab

How to import PDF assets to a Kili project

In this tutorial, we will learn how to import PDF assets to your project.

Here are the steps that we will follow:

  1. Setting up a PDF classification project.
  2. Importing some PDF assets to Kili.
  3. Adding their resolutions if needed.

Setting up a PDF Kili project to work with

Installing and instantiating Kili

First, let's install and import the required modules.

%pip install kili
import getpass
import os
from pprint import pprint

from kili.client import Kili

Now, let's set up variables needed to create an instance of the Kili object.

We will need your API key and Kili's API endpoint.

If you are unsure how to look up your API key, refer to https://docs.kili-technology.com/docs/creating-an-api-key.

if "KILI_API_KEY" not in os.environ:
    KILI_API_KEY = getpass.getpass("Please enter your API key: ")
else:
    KILI_API_KEY = os.environ["KILI_API_KEY"]

With variables set up, we can now create an instance of the Kili object.

kili = Kili(
    api_key=KILI_API_KEY,  # no need to pass the API_KEY if it is already in your environment variables
    # api_endpoint="https://cloud.kili-technology.com/api/label/v2/graphql",
    # the line above can be uncommented and changed if you are working with an on-premise version of Kili
)

Creating a PDF Kili project

To create a PDF Kili project, you must first set up its ontology.

Here, we will only add a classification task:

interface = {
    "jobs": {
        "CLASSIFICATION_JOB": {
            "mlTask": "CLASSIFICATION",
            "content": {
                "categories": {
                    "RESEARCH": {"name": "Research", "children": [], "id": "category1"},
                    "GUIDELINES": {"name": "Guidelines", "children": [], "id": "category2"},
                    "TREND_ANALYSIS": {"children": [], "name": "Trend analysis", "id": "category3"},
                },
                "input": "radio",
            },
            "instruction": "Classify the document",
            "required": 1,
            "isChild": False,
            "isNew": False,
        },
    }
}

project = kili.create_project(
    title="[Kili SDK Notebook]: Importing assets with metadata - PDF",
    description="Project Description",
    input_type="PDF",
    json_interface=interface,
)

You can now add assets. Here, we will add PDF from public URLs:

project_id = project["id"]
external_ids = ["2306.17766", "2306.17582", "2306.17514"]

assets = kili.append_many_to_dataset(
    project_id=project_id,
    content_array=[
        "https://arxiv.org/pdf/2306.17766",
        "https://arxiv.org/pdf/2306.17582",
        "https://arxiv.org/pdf/2306.17514",
    ],
    external_id_array=external_ids,
)

Update and check the resolutions

You can now update and check the resolutions of the pdf assets.

⚠️ This step is only necessary for assets that have been annotated before the 2023-07-05. Before this date, the page resolutions were not stored once a label was submitted, so if you need to backfill this value if you need it.

from kili.utils.assets import PageResolution

kili.update_properties_in_assets(
    project_id=project_id,
    external_ids=external_ids,
    page_resolutions_array=[
        [
            PageResolution(page_number=1, height=700, width=500),
            PageResolution(page_number=2, height=700, width=500),
            PageResolution(page_number=3, height=700, width=500),
            PageResolution(page_number=4, height=700, width=500),
        ],
        [
            PageResolution(page_number=1, height=700, width=500),
            PageResolution(page_number=2, height=700, width=500),
            PageResolution(page_number=3, height=700, width=500),
            PageResolution(page_number=4, height=700, width=500),
        ],
        [
            PageResolution(page_number=1, height=700, width=500),
            PageResolution(page_number=2, height=700, width=500),
            PageResolution(page_number=3, height=700, width=500),
            PageResolution(page_number=4, height=700, width=500),
        ],
    ],
)

Fetching the asset resolutions

You can fetch the PDF asset resolutions this way:

assets = kili.assets(
    project_id,
    fields=["id", "pageResolutions.height", "pageResolutions.width", "pageResolutions.pageNumber"],
)
pprint(assets)
[{'id': 'cljohld0v0000lfvz0ktpb290',
  'pageResolutions': [{'height': 700, 'pageNumber': 1, 'width': 500},
                      {'height': 700, 'pageNumber': 2, 'width': 500},
                      {'height': 700, 'pageNumber': 3, 'width': 500},
                      {'height': 700, 'pageNumber': 4, 'width': 500}]},
 {'id': 'cljohld0v0001lfvzok0h4mlu',
  'pageResolutions': [{'height': 700, 'pageNumber': 1, 'width': 500},
                      {'height': 700, 'pageNumber': 2, 'width': 500},
                      {'height': 700, 'pageNumber': 3, 'width': 500},
                      {'height': 700, 'pageNumber': 4, 'width': 500}]},
 {'id': 'cljohld0v0002lfvz3vid56ha',
  'pageResolutions': [{'height': 700, 'pageNumber': 1, 'width': 500},
                      {'height': 700, 'pageNumber': 2, 'width': 500},
                      {'height': 700, 'pageNumber': 3, 'width': 500},
                      {'height': 700, 'pageNumber': 4, 'width': 500}]}]

Cleanup

We can remove the project that we created if needed:

kili.delete_project(project_id)