Continuous performance and scale validation of Red Hat OpenShift AI model-serving stack

in RHOAI, PSAP, Work | blog-post

Today, my blog post on continuous performance and scale testing of Red Hat OpenShift KServe model serving stack was published!

Great work in collaboration with multiple persons from PSAP team and RHOAI QE/dev teams.

It presents the results of three different flavors of performance and scale testing. Each flavor focuses on a particular aspect of the KServe model serving stack:

  • Single-model performance testing
    • Focuses on the performance of the model and its serving runtime to verify that it does not regress over the releases.
  • Multi-model performance and scale testing
    • Focuses on the performance of the model serving stack when running under heavy load but at low scale.
  • Many-model scale testing
    • Focuses on the scalability of the model deployment stack when running at large scale.


A Guide to Scaling OpenShift Data Science to Hundreds of Users and Notebooks

in RHOAI, PSAP, Work | blog-post

Today we published a blog post presenting the results of my last 6 months of work: testing Red Hat OpenShift Data Science with 300 users requesting notebooks within 15 minutes.

It was a huge work getting the scale testing infrastructure in place, but it was fruitful :) Along the way, we highlighted: - a network component dealing badly with its frequent reconfiguration (it was randomly throwing 404 error). It got removed from the architecture. - a control plane overload leading to its collapsing (and auto-recovery). The component spamming the Kubernetes API Server got refactored to avoid the compute-intensive aggressive requests. - multiple race conditions in the Web UI, appearing under random conditions (including the system load, but not only) and hence hard to observe and reproduce manually. We tracked down the route cause of the issues and got them fixed.

The blog post shows the final result, where OpenShift Data Science and scale test infrastructure are running happily, there wasn’t enough space for describing the route to reach it, pity 😅 🐞

And that’s just the beginning, now that the baseline is defined, we need to bring in more users, in less time, and optimize the time for getting a notebook … still a lot of work ahead!

RHODS notebook scale testing

A Guide to Functional and Performance Testing of the NVIDIA DGX A100

in Work | blog-post

My blog post on NVIDIA DGX A100 GPU testing got published yesterday on OpenShift blog :)

In this blog post, we present how we performed the function validation of the OpenShift GPU Operator running on eight GPUs within DGX™ A100. We describe the different MIG modes we tested, as well as the values of the node labels and Kubernetes resources exposed with these different settings. We also conduct a performance benchmark, involving the eight GPUs running simultaneously, either all training a single AI/ML model or all performing independent computations.


Entitlement-Free Deployment of the NVIDIA GPU Operator on OpenShift

in PSAP | blog-post

Last night, we published the blog post presenting the work I lead during the past few months, where we removed the need to deploy RHEL entitlement certificates to build and deploy the GPU driver of NVIDIA GPU Operator. This requirement for a valid RHEL subscription key was a major burden for OpenShift GPU computing customers, as the key generation and deployment process couldn’t be properly automated.

This work was a great cooperation effort, as it required the enhancement of multiple parts of the overall software stack:

  • first at the team level with enhancements of the Node Feature Discovery (Eduardo Arango) and the OpenShift Driver Toolkit container image (David Gray and Zvonko Kaiser) + Ashish Kamra
  • then at the project level, with core OpenShift corner-cases bugs discovered and solved, revolving around the Driver Toolkit dynamic image-streams,
  • finally inter-company Open Source cooperation, with NVIDIA Cloud-Native team (Shiva Krishna Merla) reviewing the PRs with valuable feedback and bugs spotted in the middle of the different rewriting of the logic!

Link to the blog post

Timeline of the project milestones (in 2021):

  • May 27th..June 1st: idea of using the Driver Toolkit for entitlement-free arises from Slack discussion, to solve disconnected cluster deployment challenges. Confirming quickly that with minor changes, the DTK provides everything required to build NVIDIA driver.

  • July 30th..August 11th: working POC of the GPU Operator building without entitlement, without any modification of the operator, only a bit of configuration + manually baked YAML files

  • August 26th..~November 15th: intensive work to add seamless upgrade support to the POC and get it all polished, tested and merged in the GPU Operator

  • December 2nd: GPU Operator v1.9.0 releases, with the entitlement-free deployment enabled by default in OpenShift \o/

It’s funny to see how it took only a couple of days to get the first POC working, while the integration of the seamless upgrade support took two full months of work!

(Seamless upgrade support is the idea that at a given time during the cluster upgrade, different nodes may run different versions of the OS. With one-container-image-for-all-os-versions, no worry, the driver deployment will work all the time; but with one-container-imager-per-os-version, that’s another topic! This is covered in-depth in the blog post.)

Using NVIDIA A100’s Multi-Instance GPU to Run Multiple Workloads in Parallel on a Single GPU

in PSAP | blog-post

Today, my work on benchmarking NVIDIA A100 Multi-Instance GPUs running multiple AI/ML workloads in parallel has been published on OpenShift blog:

The new Multi-Instance GPU (MIG) feature lets GPUs based on the NVIDIA Ampere architecture run multiple GPU-accelerated CUDA applications in parallel in a fully isolated way. The compute units of the GPU, as well as its memory, can be partitioned into multiple MIG instances. Each of these instances presents as a stand-alone GPU device from the system perspective and can be bound to any application, container, or virtual machine running on the node. At the hardware-level, the MIG instance has its own dedicated resources (compute, cache, memory), so the workload running in one instance does not affect what is running on the other ones.

In collaboration with NVIDIA, we extended the GPU Operator to give OpenShift users the ability to dynamically reconfigure the geometry of the MIG partitioning. The geometry of the MIG partitioning is the way hardware resources are bound to MIG instances, so it directly influences their performance and the number of instances that can be allocated. The A100-40GB, which we used for this benchmark, has eight compute units and 40 GB of RAM. When the MIG mode is enabled, the eighth instance is reserved for resource management.

NVIDIA A100 MIG benchmark

Multi-Instance GPU Support with the GPU Operator v1.7.0

in Work | blog-post

Today, my work on enabling NVIDIA GPU Operator to support the A100 Multi-Instance GPU capability has been released, and we published a blog post on the topic:

Version 1.7.0 of the GPU Operator has just landed in OpenShift OperatorHub, with many different updates. We are proud to announce that this version comes with the support of the NVIDIA Multi-Instance GPU (MIG) feature for the A100 and A30 Ampere cards. MIG is the capability of the NVIDIA GPU card to be partitioned into multiple instances and exposed into pods as independent GPUs.

This MIG support on the GPU Operator comes from a joint effort between the NVIDIA Cloud Native team and Red Hat Performance and Latency Sensitive Applications (PSAP) team.

NVIDIA A100 MIG support

Benchmarking HPC workloads on OpenShift

in Presentation, PSAP | talk

Today, our talk (with my team-mate David Gray) entitled Benchmarking HPC workloads on OpenShift got presented at 2021. It’s a reworked version of what we presented at SuperComputing 2020 - OpenShift Gathering. This talk was more focused on my benchmarking work.

In this session, we’ll demonstrate how we used OpenShift as a proof-of-concept high-performance computing (HPC) platform for running scientific workload.

We’ll present the set of tools and operators that were used to setup the HPC environment, then we’ll introduce two scientific applications, Gromacs and Specfem, that we benchmarked on this cluster.

We’ll detail in how we ran Specfem on OpenShift with the help of a K8s Go client coordinating the application build and execution; and we’ll introduce the tool we designed to run the extensive benchmarking.

Finally, we’ll present the performance results on a 32-node cluster comparing OpenShift with an identical bare-metal cluster. 2021

HPC on Openshift Deploying Scientific Workloads on OpenShift

in Presentation, PSAP | talk

Today, our talk (with my team-mate David Gray) entitled HPC on Openshift Deploying Scientific Workloads on OpenShift with the MPI Operator got presented at the OpenShift Commons workshop of KubeCon/NA conference.

High Performance Computing (HPC) workloads increasingly rely on the use of containers that make applications easier to manage, preserve their dependencies and add portability across different environments. Red Hat OpenShift Container Platform is a Kubernetes-based platform for deploying containerized applications on shared compute resources.

In this talk we will show how to effectively deploy scientific applications, GROMACS and SPECFEM3D Globe, on OpenShift using the MPI Operator from the Kubeflow project using two different distributed shared filesystems, Lustre and CephFS.

We also publish in-depth blog posts on this topic:

OpenShift Commons

Recording SPICE Adaptive Streaming

in Presentation | talk

Focused version

Today, I recorded a video clip presenting my work on SPICE Adaptive Streaming (based on my last talk in Grenoble).

More demos about the SPICE project are available on the website.

As part of the SPICE Adaptive Streaming project, we developed a toolbox for studying the performance of real-time video streaming. The toolbox consists in:

(1) a recording infrastructure that collects performance indicators in the guest/host/client systems,

(2) a scripted benchmarking engine that controls the systems’ configuration and the video encoding settings, and benchmarks each set of parameters one by one, and

(3) a graph visualization GUI that plots the benchmarks results and allows studying the impact of each parameter variations.

In the current step, we are working on a mathematical modelisation of the resource usage (CPU, GPU, etc.)