Virtual Open Systems March newsletter was published today, with an article about my work:

Checkpointing for HPC: High performance live checkpointing

At the 2018 HiPEAC ExascaleHPC workshop organized in the context of the ExaNoDe EC project, Virtual Open Systems has presented the progress of its implementation of live and incremental checkpointing, for Qemu-KVM.

The live aspect of this work reduce the virtual machine (VM) downtime to a few milliseconds, while the RAM is copied to disk in background; and the incremental aspect allows to checkpoint only the pages actually modified since the previous checkpoint. Periodic virtual machine checkpointing improves the reliability of HPC and cloud-computing environments, as it prevents the loss of volatile data in case of hardware failure. The live aspect makes it virtually transparent for the user, whose VM keeps running unaltered during the checkpointing. The incremental aspect further reduces the checkpoint impact on the system, as only part of the RAM is saved, and also reduces the footprint of the checkpoints on the disk.

The challenges behind both aspects of the checkpointing are related to the tracking and handling of the memory pages being modified by the guest system. Virtual Open Systems developed a novel approach to track these changes in Qemu, which guarantees the consistency of every checkpoint, regardless the activity of the guest system.

Qemu checkpointing