How to convert a Tagtog project to Kili
Setup
%pip install kili rich beautifulsoup4
import json
import os
import pprint
from pathlib import Path
from bs4 import BeautifulSoup
from rich import print_json
from kili.client import Kili
kili = Kili()
Get data
For this recipe, we will use data from the Apartment reviews project made by Jennifer D. Ames.
The dataset consists in guests' reviews of apartments/houses for short-term lodging and tourism-related activities. The reviews have been obtained from Inside Airbnb. Each review has been manually annotated in tagtog.
To download the project data, it is required to be logged in to tagtog. Then, you can download the project data.
Once the .zip file is downloaded, we can unzip it:
!tar -xvf tagtog_ApartmentReviews.zip
print(Path("ApartmentReviews/README.md").read_text(encoding="utf-8"))
This dataset lives in: https://tagtog.com/JaqenNLP/ApartmentReviews
This zip was generated with:
* date: _2023-05-10T11:25:54.902Z_
* search: `*`
* total found documents: **228**
The dataset is here written in the [anndoc format](https://docs.tagtog.com/anndoc.html). Use the `annotations-legend.json` file to help you interpret the annotations.
What great things will you do with the dataset? :-) Enjoy!
Let's take a look at the ontology:
tagtog_class_id_to_class_name = json.loads(
Path("ApartmentReviews/annotations-legend.json").read_text(encoding="utf-8")
)
pprint.pprint(tagtog_class_id_to_class_name)
{'e_1': 'Loyalty',
'e_13': 'TechOpinion',
'e_14': 'FoodOpinion',
'e_2': 'RoomOpinion',
'e_4': 'ApartmentOpinion',
'e_5': 'Location',
'e_6': 'KitchenOpinion',
'e_7': 'BathroomOpinion',
'e_8': 'HostOpinion',
'e_9': 'PricePayment',
'f_10': 'RoomItem',
'f_11': 'OutdoorFeature',
'f_12': 'BathroomFeature',
'f_15': 'TechFeature',
'f_16': 'FoodFeature',
'f_17': 'LocationFeature',
'f_19': 'LoyaltyFeature',
'f_20': 'ApartmentFeature',
'f_21': 'HostCharacter',
'f_22': 'isLocationQuiet',
'f_23': 'Sentiment',
'f_24': 'isCriticalProblem',
'f_25': 'KitchenFeature',
'f_26': 'PaymentFeature',
'm_18': 'Sentiment'}
The ontology shows:
- 10 entity types (keys starting with
e_
)
- 1 document label (key starting with
m_
)
- 14 entity labels (keys starting with
f_
)
An entity label is used to tag an entity type, and aims at giving more information about the entity being annotated. For example, the "f_17": "LocationFeature"
label can be used to tag a "e_5": "Location"
entity type.
Read more about this project ontology here.
The plain.html/pool
folder contains the reviews in HTML format:
print(
Path("ApartmentReviews/plain.html/pool/a.km05GoV2Uh1mw9QR.UNiNXWUL8-text.plain.html").read_text(
encoding="utf-8"
)
)
<!DOCTYPE html >
<html id="a.km05GoV2Uh1mw9QR.UNiNXWUL8-text" data-origid="text" class="anndoc" data-anndoc-version="3.6" lang="" xml:lang="" xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta charset="UTF-8"/>
<meta name="generator" content="net.tagtog.anndoc.v3.parsers.general.PlainTextParser_NewParagraphAfter1Newline_v2_0_0"/>
<title>a.km05GoV2Uh1mw9QR.UNiNXWUL8-text</title>
</head>
<body>
<article>
<section data-type="">
<div class="content">
<p id="s1p1">The house is beautiful, but it is next to 2 very busy roads. There are no aircons in any rooms and to open the windows you hear the cars and trucks pass - very loud. There is no braai facilities and when we asked x 2 we were told the owner had to approve... still waiting! The rooms on the lower floor is empty and depressing, very hot! Kids slept on couches with sliding doors open, so we could not arm the alarm at night. The aircons on the upper level leaks water onto table and electrical wires. I think it is over priced, for that price the basics should be in place. AIRCONS AND BRAAI</p>
</div>
</section>
</article>
</body>
</html>
The annotations are stored in the folder ann.json/master/pool
:
annotation_sample = json.loads(
Path(
"ApartmentReviews/ann.json/master/pool/a.km05GoV2Uh1mw9QR.UNiNXWUL8-text.ann.json"
).read_text(encoding="utf-8")
)
# select only two entities for visualization
annotation_sample["entities"] = annotation_sample["entities"][:2]
pprint.pprint(annotation_sample)
{'anncomplete': True,
'annotatable': {'parts': ['s1p1']},
'entities': [{'classId': 'e_4',
'confidence': {'prob': 1,
'state': 'pre-added',
'who': ['user:Mrdenho']},
'coordinates': [],
'fields': {'f_23': {'confidence': {'prob': 1,
'state': 'pre-added',
'who': ['user:Mrdenho']},
'value': 'positive'}},
'normalizations': {},
'offsets': [{'start': 0, 'text': 'The house is beautiful'}],
'part': 's1p1'},
{'classId': 'e_5',
'confidence': {'prob': 1,
'state': 'pre-added',
'who': ['user:Mrdenho']},
'coordinates': [],
'fields': {'f_22': {'confidence': {'prob': 1,
'state': 'pre-added',
'who': ['user:Mrdenho']},
'value': False},
'f_23': {'confidence': {'prob': 1,
'state': 'pre-added',
'who': ['user:Mrdenho']},
'value': 'negative'},
'f_24': {'confidence': {'prob': 1,
'state': 'pre-added',
'who': ['user:Mrdenho']},
'value': True}},
'normalizations': {},
'offsets': [{'start': 24,
'text': 'but it is next to 2 very busy roads'}],
'part': 's1p1'}],
'metas': {'m_18': {'confidence': {'prob': 1,
'state': 'pre-added',
'who': ['user:Mrdenho']},
'value': 'negative'}},
'relations': [],
'sources': []}
Create the Kili project
We can start creating the Named Entity Recognition (NER) Kili project.
To do so, we will need to define a json interface that follows the ontology of the annotations-legend.json
file.
Note that with Kili, the json interface can be created directly from the Kili UI.
Here, we will use a json_interface
variable that will be passed to kili.create_project()
.
We have two main jobs:
SENTIMENT_CLASSIFICATION_JOB
for classifying the review as positive, negative or neutralNAMED_ENTITIES_RECOGNITION_JOB
for the entity type.
We also have several child jobs representing the entity labels.
# fmt: off
json_interface = {
"jobs": {
"SENTIMENT_CLASSIFICATION_JOB": {
"content": {
"categories": {
"positive": {"children": [], "name": "positive"},
"neutral": {"children": [], "name": "neutral"},
"negative": {"children": [], "name": "negative"},
},
"input": "radio",
},
"instruction": "Sentiment",
"mlTask": "CLASSIFICATION",
"required": 0,
"isChild": False,
},
"NAMED_ENTITIES_RECOGNITION_JOB": {
"content": {
"categories": {
"HostOpinion": {
"children": ["isCriticalProblem", "HostCharacter", "Sentiment"],
"color": "#472CED",
"name": "HostOpinion",
},
"RoomOpinion": {
"children": ["isCriticalProblem", "RoomItem", "Sentiment"],
"name": "RoomOpinion",
"color": "#5CE7B7",
},
"BathroomOpinion": {
"children": ["isCriticalProblem", "Sentiment", "BathroomFeature"],
"name": "BathroomOpinion",
"color": "#D33BCE",
},
"FoodOpinion": {
"children": ["isCriticalProblem", "FoodFeature", "Sentiment"],
"name": "FoodOpinion",
"color": "#FB753C",
},
"Location": {
"children": ["LocationFeature", "isCriticalProblem", "Sentiment", "isLocationQuiet"],
"name": "Location",
"color": "#3BCADB"
},
"KitchenOpinion": {
"children": ["isCriticalProblem", "Sentiment", "KitchenFeature"],
"name": "KitchenOpinion",
"color": "#199CFC",
},
"PricePayment": {
"children": ["isCriticalProblem", "Sentiment", "PaymentFeature"],
"name": "PricePayment",
"color": "#FA484A",
},
"ApartmentOpinion": {
"children": ["isCriticalProblem", "ApartmentFeature", "Sentiment", "KitchenFeature"],
"name": "ApartmentOpinion",
"color": "#ECB82A"
},
"Loyalty": {
"children": ["isCriticalProblem", "LoyaltyFeature", "Sentiment"],
"name": "Loyalty",
"color": "#3CD876",
},
"TechOpinion": {
"children": ["isCriticalProblem", "Sentiment", "TechFeature"],
"name": "TechOpinion",
"color": "#733AFB",
},
},
"input": "radio",
},
"instruction": "Entity type",
"mlTask": "NAMED_ENTITIES_RECOGNITION",
"required": 1,
"isChild": False,
},
"LocationFeature": {
"content": {
"categories": {
"center": {"children": [], "name": "center"},
"airport": {"children": [], "name": "airport"},
"underground": {"children": [], "name": "underground"},
"tram": {"children": [], "name": "tram"},
"bus": {"children": [], "name": "bus"},
"train": {"children": [], "name": "train"},
"highway": {"children": [], "name": "highway"},
"parking": {"children": [], "name": "parking"},
"shopping": {"children": [], "name": "shopping"},
"tourism": {"children": [], "name": "tourism"},
"restaurants": {"children": [], "name": "restaurants"},
"pubs": {"children": [], "name": "pubs"},
"nature": {"children": [], "name": "nature"},
"arrival": {"children": [], "name": "arrival"},
"safety": {"children": [], "name": "safety"},
"ownvehicle": {"children": [], "name": "ownvehicle"},
"taxi": {"children": [], "name": "taxi"},
},
"input": "checkbox",
},
"instruction": "LocationFeature",
"mlTask": "CLASSIFICATION",
"required": 0,
"isChild": True,
},
"isCriticalProblem": {
"content": {
"categories": {
"yes": {"children": [], "name": "yes"},
"no": {"children": [], "name": "no"},
},
"input": "radio",
},
"instruction": "isCriticalProblem",
"mlTask": "CLASSIFICATION",
"required": 0,
"isChild": True,
},
"ApartmentFeature": {
"content": {
"categories": {
"cleanliness": {"children": [], "name": "cleanliness"},
"stairs": {"children": [], "name": "stairs"},
"compliantDescription": {"children": [], "name": "compliantDescription"},
"look": {"children": [], "name": "look"},
"space": {"children": [], "name": "space"},
"climate": {"children": [], "name": "climate"},
"pets": {"children": [], "name": "pets"},
"gym": {"children": [], "name": "gym"},
"safety": {"children": [], "name": "safety"},
},
"input": "checkbox",
},
"instruction": "ApartmentFeature",
"mlTask": "CLASSIFICATION",
"required": 0,
"isChild": True,
},
"FoodFeature": {
"content": {
"categories": {
"breakfast": {"children": [], "name": "breakfast"},
"drinks": {"children": [], "name": "drinks"},
"snacks": {"children": [], "name": "snacks"},
},
"input": "checkbox",
},
"instruction": "FoodFeature",
"mlTask": "CLASSIFICATION",
"required": 0,
"isChild": True,
},
"LoyaltyFeature": {
"content": {
"categories": {
"solo": {"children": [], "name": "solo"},
"couples": {"children": [], "name": "couples"},
"family": {"children": [], "name": "family"},
"friends": {"children": [], "name": "friends"},
"repeat": {"children": [], "name": "repeat"},
},
"input": "checkbox",
},
"instruction": "LoyaltyFeature",
"mlTask": "CLASSIFICATION",
"required": 0,
"isChild": True,
},
"HostCharacter": {
"content": {
"categories": {
"friendliness": {"children": [], "name": "friendliness"},
"politeness": {"children": [], "name": "politeness"},
"communication": {"children": [], "name": "communication"},
},
"input": "checkbox",
},
"instruction": "HostCharacter",
"mlTask": "CLASSIFICATION",
"required": 0,
"isChild": True,
},
"RoomItem": {
"content": {
"categories": {
"bed": {"children": [], "name": "bed"},
"wardrobe": {"children": [], "name": "wardrobe"},
"chair": {"children": [], "name": "chair"},
"desk": {"children": [], "name": "desk"},
},
"input": "checkbox",
},
"instruction": "RoomItem",
"mlTask": "CLASSIFICATION",
"required": 0,
"isChild": True,
},
"Sentiment": {
"content": {
"categories": {
"positive": {"children": [], "name": "positive"},
"negative": {"children": [], "name": "negative"},
},
"input": "radio",
},
"instruction": "Sentiment",
"mlTask": "CLASSIFICATION",
"required": 0,
"isChild": True,
},
"TechFeature": {
"content": {
"categories": {
"wifi": {"children": [], "name": "wifi"},
"keypad": {"children": [], "name": "keypad"},
"smarthome": {"children": [], "name": "smarthome"},
"mobile": {"children": [], "name": "mobile"},
"tv": {"children": [], "name": "tv"},
"videogames": {"children": [], "name": "videogames"},
},
"input": "checkbox",
},
"instruction": "TechFeature",
"mlTask": "CLASSIFICATION",
"required": 0,
"isChild": True,
},
"BathroomFeature": {
"content": {
"categories": {
"shower": {"children": [], "name": "shower"},
"bathtub": {"children": [], "name": "bathtub"},
"wc": {"children": [], "name": "wc"},
"amenities": {"children": [], "name": "amenities"},
"hairdryer": {"children": [], "name": "hairdryer"},
},
"input": "checkbox",
},
"instruction": "BathroomFeature",
"mlTask": "CLASSIFICATION",
"required": 0,
"isChild": True,
},
"PaymentFeature": {
"content": {
"categories": {
"refund": {"children": [], "name": "refund"},
"deposit": {"children": [], "name": "deposit"},
"payment_methods": {"children": [], "name": "payment methods"},
"extra_charges": {"children": [], "name": "extra charges"},
},
"input": "checkbox",
},
"instruction": "PaymentFeature",
"mlTask": "CLASSIFICATION",
"required": 0,
"isChild": True,
},
"isLocationQuiet": {
"content": {
"categories": {
"Quiet": {"children": [], "name": "Quiet"},
"Noisy": {"children": [], "name": "Noisy"},
},
"input": "radio",
},
"instruction": "isLocationQuiet",
"mlTask": "CLASSIFICATION",
"required": 0,
"isChild": True,
},
"OutdoorFeature": {
"content": {
"categories": {
"garden": {"children": [], "name": "garden"},
"patio": {"children": [], "name": "patio"},
"balcon": {"children": [], "name": "balcon"},
"terrace": {"children": [], "name": "terrace"},
"view": {"children": [], "name": "view"},
"pool": {"children": [], "name": "pool"},
"barbecue": {"children": [], "name": "barbecue"},
},
"input": "checkbox",
},
"instruction": "OutdoorFeature",
"mlTask": "CLASSIFICATION",
"required": 0,
"isChild": True,
},
"KitchenFeature": {
"content": {
"categories": {"appliances": {"children": [], "name": "appliances"}},
"input": "checkbox",
},
"instruction": "KitchenFeature",
"mlTask": "CLASSIFICATION",
"required": 0,
"isChild": True,
},
}
}
# fmt: on
Let's create the text project with our ontology:
project_id = kili.create_project(
input_type="TEXT", json_interface=json_interface, title="Tagtog to Kili recipe"
)["id"]
In the Kili project created, we can see the ontology that follows the same structure as the one in tagtog:
Upload assets
Now that the project is created, we can upload our assets.
filepaths = [
Path("ApartmentReviews/plain.html/pool/") / filepath
for filepath in os.listdir("ApartmentReviews/plain.html/pool/")
]
print(len(filepaths))
228
external_ids = [filepath.name.replace(".plain.html", "") for filepath in filepaths]
assets = [
BeautifulSoup(file.read_text(encoding="utf-8"), "html.parser").find("body").get_text().strip()
for file in filepaths
]
print(external_ids[0])
print(assets[0])
aKRwFKFHhO5FoLBBCEUJl8gkQoBS-text
Jason was great! He welcomed us with cold waters and fully explained the apartment and its many amenities. He provided anything we could have forgotten and made sure to answer the many questions we had. He was quick to answer any additional questions throughout the stay as well. The apartment is located on a quiet hutong right off a main street that is right in the center of the old town and is just a short walk to the metro. The shower is a bit hard to get in and out of, and would be difficult if you have mobility issues. But the apartment was great and we really enjoyed a little bit of local living while in Beijing. Thanks Jason!
kili.append_many_to_dataset(project_id, content_array=assets, external_id_array=external_ids)
Our assets have been successfully uploaded to Kili:
Upload existing annotations
Now that our project is created with its assets, we can upload the annotations we made in tagtog.
But first we need to convert the tagtog annotations to the Kili format.
json_response_array = []
for asset_name, asset in zip(external_ids, assets):
# load the tagtog json annotation
annotation_path = Path("ApartmentReviews/ann.json/master/pool/") / (asset_name + ".ann.json")
tagtog_annotation = json.loads(annotation_path.read_text(encoding="utf-8"))
# skip if not annotated
if not tagtog_annotation["anncomplete"]:
continue
# kili json response
json_response = {}
# overall review sentiment classification
try:
sentiment_class = tagtog_annotation["metas"]["m_18"]["value"]
except KeyError:
pass # the asset is not annotated for sentiment
else:
assert sentiment_class in ("positive", "negative", "neutral"), sentiment_class
sentiment_confidence = int(
round(tagtog_annotation["metas"]["m_18"]["confidence"]["prob"] * 100)
)
# we add the sentiment classification to the json response
json_response["SENTIMENT_CLASSIFICATION_JOB"] = {}
json_response["SENTIMENT_CLASSIFICATION_JOB"]["categories"] = [
{"name": sentiment_class, "confidence": sentiment_confidence}
]
# we skip assets with complex html for the sake of clarity
if len(tagtog_annotation["annotatable"]["parts"]) > 1:
json_response_array.append(json_response)
continue
# Named entities recognition job
json_response["NAMED_ENTITIES_RECOGNITION_JOB"] = {"annotations": []}
for entity in tagtog_annotation["entities"]:
ner_class = tagtog_class_id_to_class_name[entity["classId"]]
# make sure the class is in the json interface
assert (
ner_class
in json_interface["jobs"]["NAMED_ENTITIES_RECOGNITION_JOB"]["content"]["categories"]
), ner_class
confidence = int(round(entity["confidence"]["prob"] * 100))
# get the begin offset and the content of the entity
assert len(entity["offsets"]) == 1, entity["offsets"]
begin_offset = entity["offsets"][0]["start"]
content = entity["offsets"][0]["text"]
# we verify that the content and begin_offset match the asset content
assert content == asset[begin_offset : begin_offset + len(content)], (
content,
asset[begin_offset : begin_offset + len(content)],
)
entity_annotation = {
"children": {},
"beginOffset": begin_offset,
"categories": [{"name": ner_class, "confidence": confidence}],
"content": content,
}
# an named entity can have tagtog entity labels
# we convert them to Kili children annotations
for field_id in entity["fields"]:
entity_label = tagtog_class_id_to_class_name[field_id]
job_interface = json_interface["jobs"][entity_label]
confidence = int(round(entity["fields"][field_id]["confidence"]["prob"] * 100))
label_class = entity["fields"][field_id]["value"]
# we convert boolean fields to match our json interface
if entity_label == "isLocationQuiet":
label_class = "Quiet" if label_class else "Noisy"
if entity_label == "isCriticalProblem":
label_class = "yes" if label_class else "no"
if entity_label == "HostCharacter" and label_class == "friendly":
label_class = "friendliness"
# some job names have spaces that we converted to "_" in our json interface
label_class = label_class.replace(" ", "_")
assert label_class in job_interface["content"]["categories"], f"{label_class} {entity}"
if entity_label not in entity_annotation["children"]:
entity_annotation["children"][entity_label] = {"categories": []}
# we add the child annotations to the NER annotation
entity_annotation["children"][entity_label]["categories"].append(
{"name": label_class, "confidence": confidence}
)
json_response["NAMED_ENTITIES_RECOGNITION_JOB"]["annotations"].append(entity_annotation)
json_response_array.append(json_response)
We can now upload the Kili annotations to our project:
kili.append_labels(
json_response_array=json_response_array,
project_id=project_id,
asset_external_id_array=external_ids,
)
We can see that this asset's labels in tagtog labeling interface:
match the Kili labels we converted previously:
We also see that the entity labels (child annotations in Kili) are shown in the labeling interface:
Conclusion
In this recipe, we saw how to convert a tagtog project to Kili.
As you can see, the Kili json interface and the json response formats are really flexible and can be adapted to a lot of use cases.
Project cleanup
kili.delete_project(project_id)