How to import PDF assets to a Kili project
In this tutorial, we will learn how to import PDF assets to your project.
Here are the steps that we will follow:
- Setting up a PDF classification project.
- Importing some PDF assets to Kili.
- Adding their resolutions if needed.
Setting up a PDF Kili project to work with
Installing and instantiating Kili
First, let's install and import the required modules.
%pip install kili
import getpass
import os
from pprint import pprint
from kili.client import Kili
Now, let's set up variables needed to create an instance of the Kili object.
We will need your API key and Kili's API endpoint.
If you are unsure how to look up your API key, refer to https://docs.kili-technology.com/docs/creating-an-api-key.
if "KILI_API_KEY" not in os.environ:
KILI_API_KEY = getpass.getpass("Please enter your API key: ")
else:
KILI_API_KEY = os.environ["KILI_API_KEY"]
With variables set up, we can now create an instance of the Kili object.
kili = Kili(
api_key=KILI_API_KEY, # no need to pass the API_KEY if it is already in your environment variables
# api_endpoint="https://cloud.kili-technology.com/api/label/v2/graphql",
# the line above can be uncommented and changed if you are working with an on-premise version of Kili
)
Creating a PDF Kili project
To create a PDF Kili project, you must first set up its ontology.
Here, we will only add a classification task:
interface = {
"jobs": {
"CLASSIFICATION_JOB": {
"mlTask": "CLASSIFICATION",
"content": {
"categories": {
"RESEARCH": {"name": "Research", "children": [], "id": "category1"},
"GUIDELINES": {"name": "Guidelines", "children": [], "id": "category2"},
"TREND_ANALYSIS": {"children": [], "name": "Trend analysis", "id": "category3"},
},
"input": "radio",
},
"instruction": "Classify the document",
"required": 1,
"isChild": False,
"isNew": False,
},
}
}
project = kili.create_project(
title="[Kili SDK Notebook]: Importing assets with metadata - PDF",
description="Project Description",
input_type="PDF",
json_interface=interface,
)
You can now add assets. Here, we will add PDF from public URLs:
project_id = project["id"]
external_ids = ["2306.17766", "2306.17582", "2306.17514"]
assets = kili.append_many_to_dataset(
project_id=project_id,
content_array=[
"https://arxiv.org/pdf/2306.17766",
"https://arxiv.org/pdf/2306.17582",
"https://arxiv.org/pdf/2306.17514",
],
external_id_array=external_ids,
)
Update and check the resolutions
You can now update and check the resolutions of the pdf assets.
⚠️ This step is only necessary for assets that have been annotated before the 2023-07-05. Before this date, the page resolutions were not stored once a label was submitted, so if you need to backfill this value if you need it.
from kili.utils.assets import PageResolution
kili.update_properties_in_assets(
project_id=project_id,
external_ids=external_ids,
page_resolutions_array=[
[
PageResolution(page_number=1, height=700, width=500),
PageResolution(page_number=2, height=700, width=500),
PageResolution(page_number=3, height=700, width=500),
PageResolution(page_number=4, height=700, width=500),
],
[
PageResolution(page_number=1, height=700, width=500),
PageResolution(page_number=2, height=700, width=500),
PageResolution(page_number=3, height=700, width=500),
PageResolution(page_number=4, height=700, width=500),
],
[
PageResolution(page_number=1, height=700, width=500),
PageResolution(page_number=2, height=700, width=500),
PageResolution(page_number=3, height=700, width=500),
PageResolution(page_number=4, height=700, width=500),
],
],
)
Fetching the asset resolutions
You can fetch the PDF asset resolutions this way:
assets = kili.assets(
project_id,
fields=["id", "pageResolutions.height", "pageResolutions.width", "pageResolutions.pageNumber"],
)
pprint(assets)
[{'id': 'cljohld0v0000lfvz0ktpb290',
'pageResolutions': [{'height': 700, 'pageNumber': 1, 'width': 500},
{'height': 700, 'pageNumber': 2, 'width': 500},
{'height': 700, 'pageNumber': 3, 'width': 500},
{'height': 700, 'pageNumber': 4, 'width': 500}]},
{'id': 'cljohld0v0001lfvzok0h4mlu',
'pageResolutions': [{'height': 700, 'pageNumber': 1, 'width': 500},
{'height': 700, 'pageNumber': 2, 'width': 500},
{'height': 700, 'pageNumber': 3, 'width': 500},
{'height': 700, 'pageNumber': 4, 'width': 500}]},
{'id': 'cljohld0v0002lfvz3vid56ha',
'pageResolutions': [{'height': 700, 'pageNumber': 1, 'width': 500},
{'height': 700, 'pageNumber': 2, 'width': 500},
{'height': 700, 'pageNumber': 3, 'width': 500},
{'height': 700, 'pageNumber': 4, 'width': 500}]}]
Cleanup
We can remove the project that we created if needed:
kili.delete_project(project_id)