Effortless Machine Learning Infrastructure: Deploy Kubeflow on Oracle OCI with Terraform

Author: Philip Godfrey

In today's data-driven world, running machine learning workloads efficiently and at scale is crucial for organizations seeking to leverage the power of AI.

The combination of Kubeflow, an open-source Machine Learning (ML) platform, and Oracle Cloud Infrastructure (OCI), a robust cloud computing solution, offers a seamless environment for ML practitioners.

In this blog, I’ll guide you through all that I’ve learned in getting Kubeflow provisioned in OCI, so users in the Oracle community don’t need to face the same pitfalls I worked through, enabling you to effortlessly set up a scalable and robust ML infrastructure

Please note, this guide has been produced for use in a development environment only - for a Production environment build, additional steps may be required.

This guide follows the process outlined in the Oracle quickstart guide on GitHub with some minor code changes in the background, as well as this really useful blog.

 1. Create a Compartment:

Identity & Security -> Compartment -> Create Compartment

Provide a Name and Description and click “Create Compartment”

A Compartment will then be provisioned, at this point we need to take a note of the Compartment OCID. 

 

 

2. Create a Dynamic Group

Identity & Security -> Dynamic Group -> Create Dynamic Group

Provide a Name and Description for the Dynamic Group, and within Matching Rules click on ‘Rule Builder’.

In the dropdown, select “Compartment OCID” and paste in the OCID taken from Step 1.   

Now click “Add Rule” and the rule will be assigned.

Click “Create” and the Dynamic Group will be provisioned.

 

3. Create a Policy (for the Dynamic Group)

Identity & Security -> Policies -> Create Policies

Under the root compartment, create a policy for the Dynamic group created in Step 2. As an example:

Using the Dynamic Group: OKEKubeflowDG

Using the Compartment: OKEKubeflowComp


4. Terraform Stack / Parameters

After cloning the Git repository we need to make some amendments to the `terraform.tfvars` template, to make it suitable for provisioning Kubeflow within OCI.

 

Starting with the above, I’ve made some amendments using the code chunk below. There are a few fields to amend here, including:

      · Compartment_ocid (see Step 1)

      · User_ocid (User Settings)

      · Fingerprint (User Settings -> API Keys -> Fingerprint)

      · Public SSH Key (requires prior creation)

terraform.tfvars (Code Chunk)

============================================

kubeflow_node_pool_size=3
bastion_shape="VM.Standard.E3.Flex"
kubeflow_node_pool_shape="VM.Standard.E3.Flex"
user_ocid = "change_me"
fingerprint = "change_me" 
pass_phrase =
"change_me"          # optional
private_key_path = "change_me"  # location of private key
compartment_ocid="change_me"
# public_ssh_key
ssh_provided_public_key = "change_me"

============================================

The SSH public key will allow us to ssh onto the Bastion Host to make any amendments within Kubeflow back-end.

In a production environment, or any environment with access to sensitive data, ensure any API key has a pass phrase attached.

  

5. Import the Terraform Stack – zip file or directory into ORM

You are now ready to run the Terraform script to provision and manage several resources, including Kubeflow.

To do this, you need to create a Stack.

Developer Services -> Resource Manager -> Stacks -> Create Stack

 

There are multiple options to install, you could zip the folder up and load that in, or simply drag and drop the entire folder you’ve cloned from the oke-kubeflow git repository, with the updated configuration file we’ve amended in Step 4.

 

After uploading as a zip, notice it’s recognised the stack information relates to Kubeflow in Oracle Kubernetes Engine (OKE).

You will notice a lot of the fields are auto-populated for us, these are being picked up from the configuration file we amended in a previous step.

There are some optional parameters we can change here, including Kubeflow Password.

 

By default, this is set as OraTest54321, but for Production this would not be suitable. This would be a good time to change this to a more secure password.

Once the configuration has been set, the final stage is to Review the Stack. Make any amendments you need and then click Create and the Stack will be provisioned. 

 

6. Run Plan for the stack

Now the Stack has been provisioned you will see there are a few options available to us, we want to focus on Plan and Apply.

 

PlanYou can run a Plan job against your stack, which parses your configuration and creates an execution plan. This can be reviewed to make sure you’re happy with what is going to be created.

Apply: Once the Plan has completed, you can run an Apply job against your stack, which provisions your resources. The apply job follows the execution plan, which is based on your Terraform configuration.

The Plan will likely take a couple of minutes to return, but once it’s complete you will be presented with Log file.

Once you have checked through the log to make sure there are no errors, you can move on to running the Apply.

 

7. Run Apply for the stack

The Apply job provisions the resources we’ve outlined in the Plan, as you can imagine this is creating several resources in the background (Network, Bastion, OKE, Kubeflow Clusters and Node pool etc.).

Expect this step to take some time to complete - around 15-20 minutes. Once complete, make sure the job has completed. If any element of the job has failed, you will need run Destroy.

In summary, while Terraform offers powerful capabilities for infrastructure as code, it's essential to approach its usage with caution, employing rigorous testing, version control, and access controls to mitigate the risk of inadvertently destroying your cloud environment.

NOTE: At the end of the log, you will be presented with a BASTION_PUBLIC_IP, make a note of this as you will use this to connect to Kubeflow.

 Navigating to the public_ip in your browser should present you with the Kubeflow login page.

Enter your login information and Kubeflow dashboard will be available. If you have made no alterations, the default user credentials are:

Username:   user@example.com

Password:    OraTest54321

 

There you have it! Kubeflow is ready to use 😊

If you’d be interested in another blog providing information on creating users in Kubeflow, please let me know in the comments.

Comments