Train an object detection model with Vertex AI AutoML and Kili for faster annotation
What is Google Vertex AI?
Vertex AI is a comprehensive machine learning platform designed for the training, deployment, and customization of ML models and AI applications, including large language models (LLMs). It seamlessly integrates with Kili to create a sophisticated model-in-the-loop data annotation workflow.
For more information on Google Vertex AI AutoML, you can visit Introduction to Vertex AI and the AutoML beginner's guide.
Tutorial Objectives
In this tutorial, we will demonstrate how to train an object detection model with annotated data on Google Vertex AI AutoML and utilize it to make predictions on unlabeled data to speed up the annotation process. The tutorial will cover the following steps: 1. Fetching and preparing data from Kili 2. Training a model with Vertex AI 3. Running predictions on unlabeled images from your Kili dataset 4. Uploading predictions to our Kili project for faster annotation
All steps can be performed through the Google Cloud Console UI, but this tutorial focuses on performing actions exclusively through the AI Platform Python SDK and Kili Python SDK to provide a deeper integration.
For this tutorial, we will use the BCCD dataset (Blood Cell Count and Detection), which is accessible on GitHub and also hosted on Hugging Face or Roboflow. It consists of 3 highly unbalanced classes: platelets, RBCs (red blood cells), and WBCs (white blood cells).
Installation Requirements
!pip install -U google-cloud-aiplatform
!pip install kili
import json
import mimetypes
import random
from pathlib import Path
from typing import List, Union
import requests
from google.cloud import aiplatform, storage
from google.cloud.aiplatform.gapic.schema import trainingjob
from google.colab import auth
from PIL import Image
from tqdm import tqdm
from kili.client import Kili
We first initialize the Kili client.
The API Key can be found in the settings of your project, in the "interface" section.. The object detection job name of your project can be found in the interface of your project.
kili_api_key = "[KILI API KEY]"
project_id = "[KILI PROJECT ID]"
OBJECT_DETECTION_JOB_NAME = "[OBJECT DETECTION JOB NAME]" # e.g: OBJECT_DETECTION_JOB
kili = Kili(api_key=kili_api_key)
# Authenticate with Google Cloud
auth.authenticate_user()
# Initialize the GCS client
storage_client = storage.Client()
project_name = "[GCP PROJECT NAME]"
location = "[GCP PROJECT LOCATION]"
bucket_name = "[GCS BUCKET NAME]"
dataset_name = "blood_cell" # name of the dataset you will create in Vertex AI
bucket_dataset_dir = (
f"experiments/{dataset_name}" # name of the folder in yout bucket where files will be stored.
)
aiplatform.init(project=project_name, location=location)
Prepare the Data
To train our Vertex AI AutoML model, we need to prepare the data within our Kili project. This section follows the guidelines provided in the Vertex AI data preparation documentation
This section covers the following steps: - Downloading annotated images from Kili to your local machine along with their labels. - Uploading the images to Google Cloud Storage. - Splitting the annotated data into train/validation/test sets. - Converting the images and labels into the required format for Vertex AI datasets. - Uploading the converted input data to Google Cloud Storage.
In this particular example, we have annotated 150 assets on the Kili app.
Retrieving and downloading labeled assets from Kili
We first call the Kili Python SDK assets
function in order to retrieve assets.
The download_media
argument allows to download the media (images here) in the folder given in the local_media_dir
argument. When doing so, the content
field will automatically be replaced by the local path of the downloaded asset.
For each asset, we query its id
, externalId
, and the jsonResponse
fields of its latest label (the last one submitted on Kili). For more information on the assets
function or on other fields that you can query, you can have a look at the function documentation.
assets = kili.assets(
project_id=project_id,
download_media=True,
local_media_dir="./images",
status_in=["LABELED"],
fields=["latestLabel.jsonResponse", "content", "id", "externalId"],
disable_tqdm=False,
)
# Plot an example image
Image.open(assets[0]["content"])
Uploading images to Google Cloud Storage
When importing data to a Vertex AI Datset, the images must already be stored on a Google Cloud Storage:
def upload_assets_to_bucket(assets: List[dict], bucket_name: str, bucket_dataset_dir: str):
bucket = storage_client.get_bucket(bucket_name)
for asset in tqdm(assets, desc="uploading assets to bucket"):
image_bucket_path = f"{bucket_dataset_dir}/images/{Path(asset['content']).name}"
image_local_path = asset["content"]
blob = bucket.blob(image_bucket_path)
blob.upload_from_filename(image_local_path)
upload_assets_to_bucket(assets, bucket_name, bucket_dataset_dir)
Splitting the Dataset into Train/Validation/Test Sets
We will divide our annotated images using the following proportions:
- Training set: 70%
- Validation set: 20%
- Test set: 10%
def split_assets_in_train_val_test(assets: list[dict]):
# shuffle the assets before splitting
random.shuffle(assets)
# Calculate the lengths of each split
total_len = len(assets)
train_len = int(0.7 * total_len)
val_len = int(0.2 * total_len)
# Split the list into train, validation, and test
train_assets = assets[:train_len]
val_assets = assets[train_len : train_len + val_len]
test_assets = assets[train_len + val_len :]
return train_assets, val_assets, test_assets
train_assets, val_assets, test_assets = split_assets_in_train_val_test(assets)
print(len(train_assets), len(val_assets), len(test_assets))
105 30 15
Converting images and labels into Vertex AI required format
Data that we upload to the Vertex AI dataset must follow a predefined schema. The schema is provided in a YAML file format that is based on the OpenAPI format.
In the following function, we generate JSON data that adheres to the required schema for an asset. This includes the location of the asset on Google Cloud Storage, the bounding boxes with their coordinates and categories, as well as the asset split.
def get_asset_io_input(asset: list[dict], split: str, bucket_name: str, image_bucket_path: str):
image_gcs_uri = f"gs://{bucket_name}/{image_bucket_path}"
labels = []
for annotation in asset["latestLabel"]["jsonResponse"][OBJECT_DETECTION_JOB_NAME][
"annotations"
]:
normalizedVertices = annotation["boundingPoly"][0]["normalizedVertices"]
label = {
"displayName": annotation["categories"][0]["name"],
"xMin": min(map(lambda vertice: vertice["x"], normalizedVertices)),
"yMin": min(map(lambda vertice: vertice["y"], normalizedVertices)),
"xMax": max(map(lambda vertice: vertice["x"], normalizedVertices)),
"yMax": max(map(lambda vertice: vertice["y"], normalizedVertices)),
}
labels.append(label)
return {
"imageGcsUri": image_gcs_uri,
"boundingBoxAnnotations": labels,
"dataItemResourceLabels": {"aiplatform.googleapis.com/ml_use": split},
}
We will generate one jsonl file storing all image json input. Each line of the jsonl file will corresponds to one data to import in the previously defined format
This jsonl file then needs to be imported to the Google Cloud Storage bucket and will be the input given at the dataset creation.
def generate_and_upload_inputs_to_bucket(
assets: List[dict], bucket_name: str, bucket_dataset_dir: str
):
output_jsonl_file = "inputs.jsonl"
bucket = storage_client.get_bucket(bucket_name)
with open(output_jsonl_file, "w") as output_file:
for split_name, split_assets in [
("training", train_assets),
("validation", val_assets),
("test", test_assets),
]:
for asset in split_assets:
# get the input object to be sent when importing the dataset
image_bucket_path = f"{bucket_dataset_dir}/images/{Path(asset['content']).name}"
input = get_asset_io_input(asset, split_name, bucket_name, image_bucket_path)
# add the example input to the jsonl file
json_line = json.dumps(input)
output_file.write(json_line + "\n")
# upload the inputs file to the bucket
blob = bucket.blob(f"{bucket_dataset_dir}/inputs.jsonl")
blob.upload_from_filename(output_jsonl_file)
generate_and_upload_inputs_to_bucket(assets, bucket_name, bucket_dataset_dir)
Create a dataset in Vertex
To train an autoML model, we need a Vertex AI dataset.
When creating this dataset, we also provide the jsonl file generated in the last section to import images with their labels into the dataset.
For more information on dataset creation, you can have a look at Vertex AI documentation
def create_and_import_dataset_image_sample(
project: str,
location: str,
display_name: str,
src_uris: Union[str, List[str]],
import_schema_uri: str,
sync: bool = True,
):
aiplatform.init(project=project, location=location)
ds = aiplatform.ImageDataset.create(
display_name=display_name,
gcs_source=src_uris,
import_schema_uri=import_schema_uri,
sync=sync,
)
ds.wait()
print(ds.display_name)
print(ds.resource_name)
return ds
inputs_uri = f"gs://{bucket_name}/{bucket_dataset_dir}/inputs.jsonl"
import_schema_uri = (
"gs://google-cloud-aiplatform/schema/dataset/ioformat/image_bounding_box_io_format_1.0.0.yaml"
)
ds = create_and_import_dataset_image_sample(
project_name, location, dataset_name, inputs_uri, import_schema_uri
)
INFO:google.cloud.aiplatform.datasets.dataset:Creating ImageDataset
INFO:google.cloud.aiplatform.datasets.dataset:Create ImageDataset backing LRO: projects/**********/locations/europe-west4/datasets/2314474175491735552/operations/7318488207820062720
INFO:google.cloud.aiplatform.datasets.dataset:ImageDataset created. Resource name: projects/**********/locations/europe-west4/datasets/2314474175491735552
INFO:google.cloud.aiplatform.datasets.dataset:To use this ImageDataset in another session:
INFO:google.cloud.aiplatform.datasets.dataset:ds = aiplatform.ImageDataset('projects/**********/locations/europe-west4/datasets/2314474175491735552')
INFO:google.cloud.aiplatform.datasets.dataset:Importing ImageDataset data: projects/**********/locations/europe-west4/datasets/2314474175491735552
INFO:google.cloud.aiplatform.datasets.dataset:Import ImageDataset data backing LRO: projects/**********/locations/europe-west4/datasets/2314474175491735552/operations/7021250632413609984
INFO:google.cloud.aiplatform.datasets.dataset:ImageDataset data imported. Resource name: projects/**********/locations/europe-west4/datasets/2314474175491735552
plastic
projects/**********/locations/europe-west4/datasets/2314474175491735552
dataset_id = ds.name.split("/")[-1]
print(dataset_id)
2314474175491735552
Train a Model
We are now ready to train our model!
When creating the training model pipeline, we provide the task definition schema (object detection here), and filter_splits
tags that provide autoML training algorithm with the splits that we have made on our dataset when importing images.
We also provide some task inputs like the model type and a time budget.
We use the default model type CLOUD_HIGH_ACCURACY_1
which is expected to have a higher latency, but should also have a higher prediction quality than other models.
The time budget signifies the maximum cost budget that we are prepared to allocate for the training. If the model converges before reaching this budget, it will cease operation. We have set the time budget at a minimum level for initial experiments, but you are free to increase it whenever you wish to transition from experimental stages to production. For a detailed explanation of autoML costs, you can visit the Vertex AI Pricing page
def create_training_pipeline_image_object_detection_sample(
project: str, display_name: str, dataset_id: str, model_display_name: str, location: str
):
client_options = {"api_endpoint": f"{location}-aiplatform.googleapis.com"}
client = aiplatform.gapic.PipelineServiceClient(client_options=client_options)
training_task_inputs = trainingjob.definition.AutoMlImageObjectDetectionInputs(
model_type="CLOUD_HIGH_ACCURACY_1",
budget_milli_node_hours=20000, # The minimum time budget
disable_early_stopping=False,
).to_value()
training_pipeline = {
"display_name": display_name,
"training_task_definition": "gs://google-cloud-aiplatform/schema/trainingjob/definition/automl_image_object_detection_1.0.0.yaml",
"training_task_inputs": training_task_inputs,
"input_data_config": {
"dataset_id": dataset_id,
"filter_split": {
"training_filter": "labels.aiplatform.googleapis.com/ml_use=training",
"validation_filter": "labels.aiplatform.googleapis.com/ml_use=validation",
"test_filter": "labels.aiplatform.googleapis.com/ml_use=test",
},
},
"model_to_upload": {"display_name": model_display_name},
}
parent = f"projects/{project}/locations/{location}"
response = client.create_training_pipeline(parent=parent, training_pipeline=training_pipeline)
print("response:", response)
return response
The given code will launch an asynchronous training job pipeline.
Once launched, the pipeline can be accessed at the following adress: https://console.cloud.google.com/vertex-ai/training/training-pipelines
training_name = "autoML-training-blood_cell-poc-v2"
model_display_name = "blood_cell-poc-v2"
response = create_training_pipeline_image_object_detection_sample(
project_name, training_name, dataset_id, model_display_name, location
)
response: name: "projects/**********/locations/europe-west4/trainingPipelines/5236749032969207808"
display_name: "autoML-training-blood_cell-poc-v2"
input_data_config {
dataset_id: "2314474175491735552"
filter_split {
training_filter: "labels.aiplatform.googleapis.com/ml_use=training"
validation_filter: "labels.aiplatform.googleapis.com/ml_use=validation"
test_filter: "labels.aiplatform.googleapis.com/ml_use=test"
}
}
training_task_definition: "gs://google-cloud-aiplatform/schema/trainingjob/definition/automl_image_object_detection_1.0.0.yaml"
training_task_inputs {
struct_value {
fields {
key: "budgetMilliNodeHours"
value {
string_value: "20000"
}
}
fields {
key: "modelType"
value {
string_value: "CLOUD_HIGH_ACCURACY_1"
}
}
}
}
model_to_upload {
display_name: "blood_cell-poc-v2"
}
state: PIPELINE_STATE_PENDING
create_time {
seconds: 1698654485
nanos: 864404000
}
update_time {
seconds: 1698654485
nanos: 864404000
}
Once trained, your model should appear at the following adress: https://console.cloud.google.com/vertex-ai/models
You can now evaluate it, deploy it, create predictions etc.
To proceed with additional operations, it's necessary to obtain the model's unique identifier (ID) or resource name, which can be found in the model's information section. This identifier is readily accessible through the Google Cloud Console interface. While it is possible to fetch the ID programmatically using the Python SDK for a more comprehensive integration, for the purposes of this tutorial, we will streamline the process by directly retrieving the ID from the console's user interface.
# MODEL ID to be found on the model registry of Vertex AI: https://console.cloud.google.com/vertex-ai/models
model_resource_name = "[MODEL ID]"
Batch inference prediction
Now that our model is trained, we can download unlabeled data from Kili, upload it to cloud storage and prepare it as required for batch prediction input as defined in Vertex AI's doc
# Retrieve and download unlabeled assets from Kili
unlabeled_assets = kili.assets(
project_id=project_id,
download_media=True,
local_media_dir="./images",
status_in=["TODO"],
fields=["content", "id", "externalId"],
disable_tqdm=False,
)
upload_assets_to_bucket(unlabeled_assets, bucket_name, bucket_dataset_dir)
def upload_test_source_to_bucket(assets: List[dict], bucket_name: str, bucket_dataset_dir: str):
output_jsonl_file = "batch_inference_inputs.jsonl"
bucket = storage_client.get_bucket(bucket_name)
with open(output_jsonl_file, "w") as output_file:
for asset in assets:
# get the input object to be sent when importing the dataset
input = {
"content": f"{bucket_dataset_dir}/images/{Path(asset['content']).name}",
"mimeType": mimetypes.guess_type(asset["content"])[0],
}
# add the test input to the jsonl file
json_line = json.dumps(input)
output_file.write(json_line + "\n")
# upload the inputs file to the bucket
blob = bucket.blob(f"{bucket_dataset_dir}/batch_inference_inputs.jsonl")
blob.upload_from_filename(output_jsonl_file)
upload_test_source_to_bucket(unlabeled_assets, bucket_name, bucket_dataset_dir)
We now create a batch prediction job and wait for its result.
def create_batch_prediction_job_sample(
project: str,
location: str,
model_resource_name: str,
job_display_name: str,
gcs_source: str,
gcs_destination: str,
sync: bool = True,
):
aiplatform.init(project=project, location=location)
my_model = aiplatform.Model(model_resource_name)
batch_prediction_job = my_model.batch_predict(
job_display_name=job_display_name,
gcs_source=gcs_source,
gcs_destination_prefix=gcs_destination,
sync=sync,
)
batch_prediction_job.wait()
print(batch_prediction_job.display_name)
print(batch_prediction_job.resource_name)
print(batch_prediction_job.state)
return batch_prediction_job
job_display_name = "batch_prediction_test"
gcs_source = f"gs://{bucket_name}/{bucket_dataset_dir}/batch_inference_inputs.jsonl"
gcs_destination = f"gs://{bucket_name}/{bucket_dataset_dir}/batch_inference"
batch_prediction_job = create_batch_prediction_job_sample(
project_name, location, model_resource_name, job_display_name, gcs_source, gcs_destination
)
Creating BatchPredictionJob
INFO:google.cloud.aiplatform.jobs:Creating BatchPredictionJob
BatchPredictionJob created. Resource name: projects/**********/locations/europe-west4/batchPredictionJobs/466893868839731200
INFO:google.cloud.aiplatform.jobs:BatchPredictionJob created. Resource name: projects/**********/locations/europe-west4/batchPredictionJobs/466893868839731200
To use this BatchPredictionJob in another session:
INFO:google.cloud.aiplatform.jobs:To use this BatchPredictionJob in another session:
bpj = aiplatform.BatchPredictionJob('projects/**********/locations/europe-west4/batchPredictionJobs/466893868839731200')
INFO:google.cloud.aiplatform.jobs:bpj = aiplatform.BatchPredictionJob('projects/**********/locations/europe-west4/batchPredictionJobs/466893868839731200')
View Batch Prediction Job:
https://console.cloud.google.com/ai/platform/locations/europe-west4/batch-predictions/466893868839731200?project=**********
INFO:google.cloud.aiplatform.jobs:View Batch Prediction Job:
https://console.cloud.google.com/ai/platform/locations/europe-west4/batch-predictions/466893868839731200?project=**********
BatchPredictionJob projects/**********/locations/europe-west4/batchPredictionJobs/466893868839731200 current state:
JobState.JOB_STATE_RUNNING
INFO:google.cloud.aiplatform.jobs:BatchPredictionJob projects/**********/locations/europe-west4/batchPredictionJobs/466893868839731200 current state:
JobState.JOB_STATE_RUNNING
BatchPredictionJob projects/**********/locations/europe-west4/batchPredictionJobs/466893868839731200 current state:
JobState.JOB_STATE_RUNNING
INFO:google.cloud.aiplatform.jobs:BatchPredictionJob projects/**********/locations/europe-west4/batchPredictionJobs/466893868839731200 current state:
JobState.JOB_STATE_RUNNING
BatchPredictionJob projects/**********/locations/europe-west4/batchPredictionJobs/466893868839731200 current state:
JobState.JOB_STATE_RUNNING
INFO:google.cloud.aiplatform.jobs:BatchPredictionJob projects/**********/locations/europe-west4/batchPredictionJobs/466893868839731200 current state:
JobState.JOB_STATE_RUNNING
BatchPredictionJob projects/**********/locations/europe-west4/batchPredictionJobs/466893868839731200 current state:
JobState.JOB_STATE_RUNNING
INFO:google.cloud.aiplatform.jobs:BatchPredictionJob projects/**********/locations/europe-west4/batchPredictionJobs/466893868839731200 current state:
JobState.JOB_STATE_RUNNING
BatchPredictionJob projects/**********/locations/europe-west4/batchPredictionJobs/466893868839731200 current state:
JobState.JOB_STATE_RUNNING
INFO:google.cloud.aiplatform.jobs:BatchPredictionJob projects/**********/locations/europe-west4/batchPredictionJobs/466893868839731200 current state:
JobState.JOB_STATE_RUNNING
BatchPredictionJob projects/**********/locations/europe-west4/batchPredictionJobs/466893868839731200 current state:
JobState.JOB_STATE_RUNNING
INFO:google.cloud.aiplatform.jobs:BatchPredictionJob projects/**********/locations/europe-west4/batchPredictionJobs/466893868839731200 current state:
JobState.JOB_STATE_RUNNING
BatchPredictionJob projects/**********/locations/europe-west4/batchPredictionJobs/466893868839731200 current state:
JobState.JOB_STATE_RUNNING
INFO:google.cloud.aiplatform.jobs:BatchPredictionJob projects/**********/locations/europe-west4/batchPredictionJobs/466893868839731200 current state:
JobState.JOB_STATE_RUNNING
BatchPredictionJob projects/**********/locations/europe-west4/batchPredictionJobs/466893868839731200 current state:
JobState.JOB_STATE_RUNNING
INFO:google.cloud.aiplatform.jobs:BatchPredictionJob projects/**********/locations/europe-west4/batchPredictionJobs/466893868839731200 current state:
JobState.JOB_STATE_RUNNING
BatchPredictionJob projects/**********/locations/europe-west4/batchPredictionJobs/466893868839731200 current state:
JobState.JOB_STATE_RUNNING
INFO:google.cloud.aiplatform.jobs:BatchPredictionJob projects/**********/locations/europe-west4/batchPredictionJobs/466893868839731200 current state:
JobState.JOB_STATE_RUNNING
BatchPredictionJob projects/**********/locations/europe-west4/batchPredictionJobs/466893868839731200 current state:
JobState.JOB_STATE_RUNNING
INFO:google.cloud.aiplatform.jobs:BatchPredictionJob projects/**********/locations/europe-west4/batchPredictionJobs/466893868839731200 current state:
JobState.JOB_STATE_RUNNING
BatchPredictionJob projects/**********/locations/europe-west4/batchPredictionJobs/466893868839731200 current state:
JobState.JOB_STATE_SUCCEEDED
INFO:google.cloud.aiplatform.jobs:BatchPredictionJob projects/**********/locations/europe-west4/batchPredictionJobs/466893868839731200 current state:
JobState.JOB_STATE_SUCCEEDED
BatchPredictionJob run completed. Resource name: projects/**********/locations/europe-west4/batchPredictionJobs/466893868839731200
INFO:google.cloud.aiplatform.jobs:BatchPredictionJob run completed. Resource name: projects/**********/locations/europe-west4/batchPredictionJobs/466893868839731200
batch_prediction_test
projects/**********/locations/europe-west4/batchPredictionJobs/466893868839731200
JobState.JOB_STATE_SUCCEEDED
The URL of the batch prediction pipeline output file in GCP can be infered from the given source destination folder given and the BatchPredictionJob output_info
attribute. However once again, for the simplicity, we will assume that we will get the URL by finding the output file on the Google cloud storage at the given source destination folder in the bucket.
# To be found in the cloud storage at the previously given destination folder
output_gcp_url = "[URL of the prediction output]"
Import predictions to Kili
jsonl_file_path = "batch_prediction_output.jsonl"
requests.get(output_gcp_url, jsonl_file_path)
Now that the predictions output are retrieved, we will convert it to the Kili format given described in the Kili documentation and upload them to your Kili project.
def vertex_to_kili(json_output):
external_id = json_output["instance"]["content"].split("/")[-1]
annotations = []
for category_name, bbox, confidence in zip(
json_output["prediction"]["displayNames"],
json_output["prediction"]["bboxes"],
json_output["prediction"]["confidence"],
):
bounding_poly = [
{"x": bbox[0], "y": bbox[3]},
{"x": bbox[0], "y": bbox[2]},
{"x": bbox[1], "y": bbox[2]},
{"x": bbox[1], "y": bbox[3]},
]
annotations.append(
{
"boundingPoly": bounding_poly,
"categories": [{"name": category_name.upper(), "confidence": confidence}],
}
)
json_response = {"OBJECT_DETECTION_JOB": {"annotations": annotations}}
return json_response, external_id
json_response_array = []
external_id_array = []
with open(jsonl_file_path) as jsonl_file:
for line in jsonl_file:
line = line.strip()
json_output = json.loads(line)
json_response, external_id = vertex_to_kili(json_output)
json_response_array.append(json_response)
external_id_array.append(external_id)
kili.create_predictions(
project_id=project_id,
json_response_array=json_response_array,
external_id_array=external_id_array,
)
Results and conclusion
You can visualize your predictions in Kili and use them as preannotations for your project:
We trained the model on a small training dataset and during a small amount of time but we are already able to get satisfying results.
In this tutorial, we have successfully demonstrated the synergy between Google Vertex AI AutoML and the Kili platform for efficient object detection model training and preannotation generation. By combining the strengths of Vertex AI and Kili, we have established a seamless workflow that optimizes the training process and fasten data annotation.
You are free to take this tutorial as an integration starting point and implement a more complex active learning pipeline.