Using NVIDIA A100’s Multi-Instance GPU to Run Multiple Workloads in Parallel on a Single GPU

in PSAP | blog-post

Today, my work on benchmarking NVIDIA A100 Multi-Instance GPUs running multiple AI/ML workloads in parallel has been published on OpenShift blog:

The new Multi-Instance GPU (MIG) feature lets GPUs based on the NVIDIA Ampere architecture run multiple GPU-accelerated CUDA applications in parallel in a fully isolated way. The compute units of the GPU, as well as its memory, can be partitioned into multiple MIG instances. Each of these instances presents as a stand-alone GPU device from the system perspective and can be bound to any application, container, or virtual machine running on the node. At the hardware-level, the MIG instance has its own dedicated resources (compute, cache, memory), so the workload running in one instance does not affect what is running on the other ones.

In collaboration with NVIDIA, we extended the GPU Operator to give OpenShift users the ability to dynamically reconfigure the geometry of the MIG partitioning. The geometry of the MIG partitioning is the way hardware resources are bound to MIG instances, so it directly influences their performance and the number of instances that can be allocated. The A100-40GB, which we used for this benchmark, has eight compute units and 40 GB of RAM. When the MIG mode is enabled, the eighth instance is reserved for resource management.

NVIDIA A100 MIG benchmark

Multi-Instance GPU Support with the GPU Operator v1.7.0

in Work | blog-post

Today, my work on enabling NVIDIA GPU Operator to support the A100 Multi-Instance GPU capability has been released, and we published a blog post on the topic:

Version 1.7.0 of the GPU Operator has just landed in OpenShift OperatorHub, with many different updates. We are proud to announce that this version comes with the support of the NVIDIA Multi-Instance GPU (MIG) feature for the A100 and A30 Ampere cards. MIG is the capability of the NVIDIA GPU card to be partitioned into multiple instances and exposed into pods as independent GPUs.

This MIG support on the GPU Operator comes from a joint effort between the NVIDIA Cloud Native team and Red Hat Performance and Latency Sensitive Applications (PSAP) team.

NVIDIA A100 MIG support

Benchmarking HPC workloads on OpenShift

in Presentation, PSAP | talk

Today, our talk (with my team-mate David Gray) entitled Benchmarking HPC workloads on OpenShift got presented at DevConf.cz 2021. It’s a reworked version of what we presented at SuperComputing 2020 - OpenShift Gathering. This talk was more focused on my benchmarking work.

In this session, we’ll demonstrate how we used OpenShift as a proof-of-concept high-performance computing (HPC) platform for running scientific workload.

We’ll present the set of tools and operators that were used to setup the HPC environment, then we’ll introduce two scientific applications, Gromacs and Specfem, that we benchmarked on this cluster.

We’ll detail in how we ran Specfem on OpenShift with the help of a K8s Go client coordinating the application build and execution; and we’ll introduce the tool we designed to run the extensive benchmarking.

Finally, we’ll present the performance results on a 32-node cluster comparing OpenShift with an identical bare-metal cluster.

DevConf.cz 2021

HPC on Openshift Deploying Scientific Workloads on OpenShift

in Presentation, PSAP | talk

Today, our talk (with my team-mate David Gray) entitled HPC on Openshift Deploying Scientific Workloads on OpenShift with the MPI Operator got presented at the OpenShift Commons workshop of KubeCon/NA conference.

High Performance Computing (HPC) workloads increasingly rely on the use of containers that make applications easier to manage, preserve their dependencies and add portability across different environments. Red Hat OpenShift Container Platform is a Kubernetes-based platform for deploying containerized applications on shared compute resources.

In this talk we will show how to effectively deploy scientific applications, GROMACS and SPECFEM3D Globe, on OpenShift using the MPI Operator from the Kubeflow project using two different distributed shared filesystems, Lustre and CephFS.

We also publish in-depth blog posts on this topic:

OpenShift Commons

Demonstrating Performance Capabilities of Red Hat OpenShift for Running Scientific HPC Workloads

in Work | blog-post

Today, we’ve published the work I’m done in cooperation with David Gray on the performance evaluation of running scientific HPC applications on OpenShift.

This blog post is a follow-up to the previous blog post on running GROMACS on Red Hat OpenShift Container Platform (OCP) using the Lustre filesystem. In this post, we will show how we ran two scientific HPC workloads on a 38-node OpenShift cluster using CephFS with OpenShift Container Storage in external mode. We will share the benchmarking results of MPI Microbenchmarks, GROMACS, and SPECFEM3D Globe. We ran these workloads on OpenShift and compared them against the results from a bare-metal MPI cluster using the same hardware.

Specfem on OpenShift

Recording SPICE Adaptive Streaming

in Presentation | talk

Focused version

Today, I recorded a video clip presenting my work on SPICE Adaptive Streaming (based on my last talk in Grenoble).

More demos about the SPICE project are available on the spice-space.org website.

As part of the SPICE Adaptive Streaming project, we developed a toolbox for studying the performance of real-time video streaming. The toolbox consists in:

(1) a recording infrastructure that collects performance indicators in the guest/host/client systems,

(2) a scripted benchmarking engine that controls the systems’ configuration and the video encoding settings, and benchmarks each set of parameters one by one, and

(3) a graph visualization GUI that plots the benchmarks results and allows studying the impact of each parameter variations.

In the current step, we are working on a mathematical modelisation of the resource usage (CPU, GPU, etc.)

Working on SPICE at RedHat

Since June 3rd, 2019, I joined RedHat SPICE team, working remotely from my place in Grenoble area, France.

With team mates spread all over Italy, UK, Poland, Israel, Brazil and the US, I will work on SPICE, RedHat solution for remote virtual desktop: you run virtual machines in a powerful server, and you access them transparently over the LAN or Internet network.

SPICE offers features such as:

  • USB redirection (plug your mouse/keyboard/USB stick in your computer, and it shows up as plugged in the VM),
  • file drag-and-drop, to seamlessly transfer files from your computer to the VM, as well as shared directories,
  • shared clipboard for transparent copy-and-paste

Happy to join RedHat

Public release of my Qemu-snapshot work

in Code | code

On 2022-01-17, I found out that Virtual Open Systems released two years ago (July 2019) the work I did as part of the Exanode European Research project during more or less two years (May 2017-May 2019):

Virtual Open Systems developed a QEMU extension for virtual machine periodic checkpointing. A repository including all the changes is available at this address. The code is released under GNU GPLv2. Virtual Open Systems is working on a companion page in its website that instructs on how to compile and reproduce the periodic checkpointing of an ARMv8 virtual machine. The page will be reachable from this address.

Unfortunately, they squashed all the commits together, making it hard to share for review and/or rebase … And the work has never been submitted upstream …

  • Commit on Virtual Open System gitlab instance
  • Copy of the commit in my Github account with minor improvements (split of the original UFFD commits, markdown formatting of the squashed commit message).

exanode qemu Virtual Open Systems