All Systems Go!

Published by James Frost.

My takeaways from the recent user-space Linux conference.

All Systems Go! was a conference in Berlin, from 2023-09-13 to 2023-09-14. I attended, mostly just because it sounded interesting. I was not disappointed.

This page contains the notes I made at the conference. I've left it fairly rough, but perhaps it is of interest.

Opening

Social event this evening, info in email. Starting at 19:00.

The sessions were livestreamed, and are now available on YouTube.

Unified Kernel Images (UKIs)

Talk given by Lennart Poettering.

UKI is a kernel + initrd, kernel command line all packed into a single UEFI binary, with systemd-stub put at the front.

systemd-stub runs first in UEFI, then transitions to the kernel. It does things like shows boot screens, measures system config (for TPM, etc.), load system extensions (such as large GPU drivers, etc.) and boots the kernel. Seeds the RNG.

PCRs store a value (like a hash) that gets combined with any hashes passed to the TPM from loaders measuring their next step. The TPM values can be compared against expected values, which are signed with a public-private key pair.

Where does the public key that is verified come from? It is enrolled at install time. It doesn't have rollback protection itself, but there are other mitigations for that. Another potential issue is that it doesn't seem to have any provision for key rotation, though maybe that is just changing it while it is booted? I guess the issue is when the vendor wants to update their key. It seems it is technically possible, but not yet implemented. Also the kernel supports CA chains, so you can just sign a new key with an old one to do migrations.

XBOOTLOADER is an extra partition to work around too small EFI partitions that commonly come with computers. If you are doing a clean install, it is better to just make the EFI partition big enough.

systemd-measure: Generates PCR11 signatures for embedding into a UKI. ukify: Python script to generate UKIs.

Kernel-install is a new tool which takes packaged kernels and turns them into UKIs, and installs them. This means that distros like debian can install kernels as UKIs, and sign them with a personal signing key.

Add-ons

PE binaries with extra stuff that gets loaded my the main UKI. This allows the local administrator to customise the UKI without needing the vendor's signing key.

btrfs Encryption

Talk given by Sweet Tea from Meta.

Facebook have lots of machines, running lots of services. Each running service has its own btrfs subvolume. They are generated by btrfs send/receive.

Service only ever writes to its own subvolume. Multiple services on a single machine, each with their own subvolume.

Goal of encryption is to aid in secure deletion, in case of a later service trying to exfiltrate previously written data.

LUKS / dm-crypt is a commonly used. It encrypts everything, but only acts on whole devices. As such it is not particularity suitable for multiple users.

fscrypt is used on android. It encrypts the file name and file data, but not the metadata. Allows one key per directory trees. However it prevents reflinks and deduplication working. It also doesn't support nested encryption, and the prevention of mixing encrypted and unencrypted content means that packaged are unnecessarily encrypted.

btrfs has been modified to support fscrypt without these limitations. The work is headed upstream, hopefully for kernel 6.7. It can be used with standard fscrypt commands.

Future work includes adding authenticated encryption.

Packages are btrfs received unencrypted (or rather encrypted with a common key), then a new key is added, and all future writes are encrypted with it.

System and Configuration Extensions for Image-based Linux Distros and Beyond

Sysexts are filesystem images (DDIs) that create an overlay for /usr or /opt. They should be purely additive.

Confexts are like sysexts but for /etc, and are meant for bringing config under a clear separation between vendor and user.

If you have certificates in the Machine Owner Keyring (MOK), you can dm-verity extensions. Purely additive extensions can be applied at runtime atomically.

Can either pin to specific OS version, or statically link everything.

Early boot settings aren't always applied. Software should work without anything in /etc. confext always mounts with noexec, so binaries and scripts should always be put in a sysext.

Originally designed for Azure boost (details in a later talk).

Other use cases where it is useful for people to be able to easily add or remove their own software, or other tools.

What is the use case over a container image?

Still work to do to enhance systemd tooling around it.

Retake of service restarts

Soft Reboot

How is the role that portable services fulfil different from a container?

They are more suitable for system services, rather than computer workloads.

TPM Talk

By Lennart Poettering

Lots of systemd tools to do various parts of distro setup. Would be good to add authentication with dm-integrity to dm-verity encryption.

Unified TPM event log

Why does linux task so long to boot?

Not a priority. Memory training, but can be cached.

A/B partitioning - let's talk about the dirty RW files

Collabora talk on Steam Deck.

Started outlining the requirements.

OpemWRT uses ro squashfs images.

Fixes size EFI, root, and small /var partition. Large subtrees of /var (/var/tmp, etc) live in the same partition as Home.

/etc is overlayed by /var partition. So system specific config is in the base image. Issues with files being brought into upper layer, even when just the timestamp changes.

SUSE gets around space issues with btrfs subvolumes.

To handle small changes in /etc the way to handle is drop-in files (e.g. /etc/sshd.d/*). Effort to get more things supporting that, and things like overlaying config in /var. Spec comming in uapi group.

Confidential, Trusted, Cloud Native Workloads

Talk given by Microsoft Azure team.

Want to provide the "Let's Encrypt" of confidential compute, ensuring that everyone can access it.

Confidential Virtual Machines allow lift and shift style move to confidential compute, and depermissions the host.

Kubernetes works with containers in pods, which can access each other.

Could run entirety of Kubernetes inside TEE, but that is rather a lot of code. So talk proposes running pods as the Trusted boundary.

Effort under name Confidential Containers (CoCo), with support from lots of vendors.

Kata containers runs containers (or pods) in lightweight VMs.

Issues around nested virtualisation, so ideally want to use bare metal servers. This has a startup performance cost.

Remote attestation is a key part of the project. RFC9334 (RATS) describes how remote attestation works.

Meta's Adventures in Userspace Linux

Userspace team started from the kernel team at Facebook. Upstream first approach.

Meta has millions of hosts, mostly homogenous, but increasing hardware diversity.

Teams run their own bare metal servers, but there are a large number of centralised container hosts.

in the middle of upgrading from CentOS Stream 8 to CentOS Stream 9.

SystemD is updated every release, and follows upstream.

systemd-oomd was originally developed at Meta, its a userspace oom killer that has more flexible policies than the kernel oom killer.

Until recently, Meta was still using network scripts. With CentOS Stream 9 now migrated to systemd-networkmanager.

Hyperscale SIG works on making CentOS Stream suitable for large scale deployments.

Asahi Linux is good because it brings to gether so many people from around the community.

Frame pointers added into fedora 38. Slightly slows things due to using an extra register, but makes profiling and debugging much easier. Originally came from someone wanting better traces in systemd.

Lots of current work around BPF:

OpenSUSE Aeon

How do demons work? E.g. having an SSH server to be able to remote in? Or is that against the opinion. How about wireguard?

Distrobox is pretty cool.

Fistrobox export is good for allowing access to packages from other distros.

Is there a way to add additional directories to distroboxes? like NFS mounts?

How does systemwide config work? Such as NFS mounts. /etc is still user editable. Layered mounting for config.

Flatpaks are the recommended installation mechanism.

Are you able to access the host apps from within distrobox? E.g. view an image generated by a CLI program.

Do encrypted subvolumes (e.g. btrfs) make homed any easier?

Does image based deployment make sense for a desktop use case?

systemd-boot integration in openSUSE

Simplifying the boot process is good. Also allows removal of crypto code from GRUB, e.g. for FDE.

Evening

Had a good evening hanging out with some other conference attendees. Two were from Meta (formerly Facebook), and two were from an international development non-profit.

It was a fun evening, and my meal was at Meta's expense.