1 Xen Live Patching Design v2

1.1 Rationale

A mechanism is required to binary patch the running hypervisor with new opcodes that have come about due to primarily security updates.

This document describes the design of the API that would allow us to upload to the hypervisor binary patches.

The document is split in four sections:

Detailed descriptions of the problem statement.
Design of the data structures.
Design of the hypercalls.
Implementation notes that should be taken into consideration.

1.2 Glossary

splice - patch in the binary code with new opcodes
trampoline - a jump to a new instruction.
payload - telemetries of the old code along with binary blob of the new function (if needed).
reloc - telemetries contained in the payload to construct proper trampoline.
hook - an auxiliary function being called before, during or after payload application or revert.
quiescent zone - period when all CPUs are lock-step with each other.

1.3 History

The document has gone under various reviews and only covers v1 design.

The end of the document has a section titled Not Yet Done which outlines ideas and design for the future version of this work.

1.4 Multiple ways to patch

The mechanism needs to be flexible to patch the hypervisor in multiple ways and be as simple as possible. The compiled code is contiguous in memory with no gaps - so we have no luxury of ‘moving’ existing code and must either insert a trampoline to the new code to be executed - or only modify in-place the code if there is sufficient space. The placement of new code has to be done by hypervisor and the virtual address for the new code is allocated dynamically.

This implies that the hypervisor must compute the new offsets when splicing in the new trampoline code. Where the trampoline is added (inside the function we are patching or just the callers?) is also important.

To lessen the amount of code in hypervisor, the consumer of the API is responsible for identifying which mechanism to employ and how many locations to patch. Combinations of modifying in-place code, adding trampoline, etc has to be supported. The API should allow read/write any memory within the hypervisor virtual address space.

We must also have a mechanism to query what has been applied and a mechanism to revert it if needed.

1.5 Workflow

The expected workflows of higher-level tools that manage multiple patches on production machines would be:

The first obvious task is loading all available / suggested hotpatches when they are available.
Whenever new hotpatches are installed, they should be loaded too.
One wants to query which modules have been loaded at runtime.
If unloading is deemed safe (see unloading below), one may want to support a workflow where a specific hotpatch is marked as bad and unloaded.

1.6 Patching code

The first mechanism to patch that comes in mind is in-place replacement. That is replace the affected code with new code. Unfortunately the x86 ISA is variable size which places limits on how much space we have available to replace the instructions. That is not a problem if the change is smaller than the original opcode and we can fill it with nops. Problems will appear if the replacement code is longer.

The second mechanism is by ti replace the call or jump to the old function with the address of the new function.

A third mechanism is to add a jump to the new function at the start of the old function. N.B. The Xen hypervisor implements the third mechanism. See Trampoline (e9 opcode) section for more details.

1.6.1 Example of trampoline and in-place splicing

As example we will assume the hypervisor does not have XSA-132 (see domctl/sysctl: don’t leak hypervisor stack to toolstacks) and we would like to binary patch the hypervisor with it. The original code looks as so:

48 89 e0                  mov    %rsp,%rax
48 25 00 80 ff ff         and    $0xffffffffffff8000,%rax

while the new patched hypervisor would be:

48 c7 45 b8 00 00 00 00   movq   $0x0,-0x48(%rbp)
48 c7 45 c0 00 00 00 00   movq   $0x0,-0x40(%rbp)
48 c7 45 c8 00 00 00 00   movq   $0x0,-0x38(%rbp)
48 89 e0                  mov    %rsp,%rax
48 25 00 80 ff ff         and    $0xffffffffffff8000,%rax

This is inside the arch_do_domctl. This new change adds 21 extra bytes of code which alters all the offsets inside the function. To alter these offsets and add the extra 21 bytes of code we might not have enough space in .text to squeeze this in.

As such we could simplify this problem by only patching the site which calls arch_do_domctl:

do_domctl:
e8 4b b1 05 00          callq  ffff82d08015fbb9 <arch_do_domctl>

with a new address for where the new arch_do_domctl would be (this area would be allocated dynamically).

Astute readers will wonder what we need to do if we were to patch do_domctl - which is not called directly by hypervisor but on behalf of the guests via the compat_hypercall_table and hypercall_table. Patching the offset in hypercall_table for do_domctl:

ffff82d08024d490:   79 30
ffff82d08024d492:   10 80 d0 82 ff ff

with the new address where the new do_domctl is possible. The other place where it is used is in hvm_hypercall64_table which would need to be patched in a similar way. This would require an in-place splicing of the new virtual address of arch_do_domctl.

In summary this example patched the callee of the affected function by

Allocating memory for the new code to live in,
Changing the virtual address in all the functions which called the old code (computing the new offset, patching the callq with a new callq).
Changing the function pointer tables with the new virtual address of the function (splicing in the new virtual address). Since this table resides in the .rodata section we would need to temporarily change the page table permissions during this part.

However it has drawbacks - the safety checks which have to make sure the function is not on the stack - must also check every caller. For some patches this could mean - if there were an sufficient large amount of callers - that we would never be able to apply the update.

Having the patching done at predetermined instances where the stacks are not deep mostly solves this problem.

1.6.2 Example of different trampoline patching.

An alternative mechanism exists where we can insert a trampoline in the existing function to be patched to jump directly to the new code. This lessens the locations to be patched to one but it puts pressure on the CPU branching logic (I-cache, but it is just one unconditional jump).

For this example we will assume that the hypervisor has not been compiled with XSA-125 (see pre-fill structures for certain HYPERVISOR_xen_version sub-ops) which mem-sets an structure in xen_version hypercall. This function is not called anywhere in the hypervisor (it is called by the guest) but referenced in the compat_hypercall_table and hypercall_table (and indirectly called from that). Patching the offset in hypercall_table for the old do_xen_version:

ffff82d08024b270 <hypercall_table>:
...
ffff82d08024b2f8:   9e 2f 11 80 d0 82 ff ff

with the new address where the new do_xen_version is possible. The other place where it is used is in hvm_hypercall64_table which would need to be patched in a similar way. This would require an in-place splicing of the new virtual address of do_xen_version.

An alternative solution would be to patch insert a trampoline in the old do_xen_version function to directly jump to the new do_xen_version:

ffff82d080112f9e do_xen_version:
ffff82d080112f9e:       48 c7 c0 da ff ff ff    mov    $0xffffffffffffffda,%rax
ffff82d080112fa5:       83 ff 09                cmp    $0x9,%edi
ffff82d080112fa8:       0f 87 24 05 00 00       ja     ffff82d0801134d2 ; do_xen_version+0x534

with:

ffff82d080112f9e do_xen_version:
ffff82d080112f9e:       e9 XX YY ZZ QQ          jmpq   [new do_xen_version]

which would lessen the amount of patching to just one location.

In summary this example patched the affected function to jump to the new replacement function which required:

Allocating memory for the new code to live in,
Inserting trampoline with new offset in the old function to point to the new function.
Optionally we can insert in the old function a trampoline jump to an function providing an BUG_ON to catch errant code.

The disadvantage of this are that the unconditional jump will consume a small I-cache penalty. However the simplicity of the patching and higher chance of passing safety checks make this a worthwhile option.

This patching has a similar drawback as inline patching - the safety checks have to make sure the function is not on the stack. However since we are replacing at a higher level (a full function as opposed to various offsets within functions) the checks are simpler.

Having the patching done at predetermined instances where the stacks are not deep mostly solves this problem as well.

1.6.3 Security

With this method we can re-write the hypervisor - and as such we MUST be diligent in only allowing certain guests to perform this operation.

Furthermore with SecureBoot or tboot, we MUST also verify the signature of the payload to be certain it came from a trusted source and integrity was intact.

As such the hypercall MUST support an XSM policy to limit what the guest is allowed to invoke. If the system is booted with signature checking the signature checking will be enforced.

1.7 Design of payload format

The payload MUST contain enough data to allow us to apply the update and also safely reverse it. As such we MUST know:

The locations in memory to be patched. This can be determined dynamically via symbols or via virtual addresses.
The new code that will be patched in.

This binary format can be constructed using an custom binary format but there are severe disadvantages of it:

The format might need to be changed and we need an mechanism to accommodate that.
It has to be platform agnostic.
Easily constructed using existing tools.

As such having the payload in an ELF file is the sensible way. We would be carrying the various sets of structures (and data) in the ELF sections under different names and with definitions.

Note that every structure has padding. This is added so that the hypervisor can re-use those fields as it sees fit.

Earlier design attempted to ineptly explain the relations of the ELF sections to each other without using proper ELF mechanism (sh_info, sh_link, data structures using Elf types, etc). This design will explain the structures and how they are used together and not dig in the ELF format - except mention that the section names should match the structure names.

The Xen Live Patch payload is a relocatable ELF binary. A typical binary would have:

One or more .text sections.
Zero or more read-only data sections.
Zero or more data sections.
Relocations for each of these sections.

It may also have some architecture-specific sections. For example:

Alternatives instructions.
Bug frames.
Exception tables.
Relocations for each of these sections.

The Xen Live Patch core code loads the payload as a standard ELF binary, relocates it and handles the architecture-specific sections as needed. This process is much like what the Linux kernel module loader does.

The payload contains at least three sections:

.livepatch.funcs - which is an array of livepatch_func structures. and/or any of:
`.livepatch.hooks.{preapply,postapply,prerevert,postrevert}’
.livepatch.hooks.{apply,revert}
- which are a pointer to a hook function pointer.
.livepatch.xen_depends - which is an ELF Note that describes what Xen build-id the payload depends on. MUST have one.
.livepatch.depends - which is an ELF Note that describes what the payload depends on. MUST have one.
.note.gnu.build-id - the build-id of this payload. MUST have one.

1.7.1 .livepatch.funcs

The .livepatch.funcs contains an array of livepatch_func structures which describe the functions to be patched:

struct livepatch_func {
    const char *name;
    void *new_addr;
    void *old_addr;
    uint32_t new_size;
    uint32_t old_size;
    uint8_t version;
    uint8_t opaque[31];
    /* Added to livepatch payload version 2: */
    uint8_t applied;
    uint8_t _pad[7];
    livepatch_expectation_t expect;
};

The size of the structure is 104 bytes on 64-bit hypervisors. It will be 92 on 32-bit hypervisors. The version 2 of the payload adds additional 8 bytes to the structure size.

name is the symbol name of the old function. Only used if old_addr is zero, otherwise will be used during dynamic linking (when hypervisor loads the payload).
old_addr is the address of the function to be patched and is filled in at payload generation time if hypervisor function address is known. If unknown, the value MUST be zero and the hypervisor will attempt to resolve the address.
new_addr can either have a non-zero value or be zero.
- If there is a non-zero value, then it is the address of the function that is replacing the old function and the address is recomputed during relocation. The value MUST be the address of the new function in the payload file.
- If the value is zero, then we NOPing out at the old_addr location new_size bytes.
old_size contains the sizes of the respective old_addr function in bytes. The value of old_size MUST not be zero.
new_size depends on what new_addr contains:
- If new_addr contains an non-zero value, then new_size has the size of the new function (which will replace the one at old_addr) in bytes.
- If the value of new_addr is zero then new_size determines how many instruction bytes to NOP (up to opaque size modulo smallest platform instruction - 1 byte x86 and 4 bytes on ARM).
version indicates version of the generated payload.
opaque MUST be zero.

The version 2 of the payload adds the following fields to the structure:

applied tracks function’s applied/reverted state. It has a boolean type either LIVEPATCH_FUNC_NOT_APPLIED or LIVEPATCH_FUNC_APPLIED.
_pad[7] adds padding to align to 8 bytes.
expect is an optional structure containing expected to-be-replaced data (mostly for inline asm patching). The expect structure format is:

struct livepatch_expectation { uint8_t enabled : 1; uint8_t len : 5; uint8_t rsv: 2; uint8_t data[LIVEPATCH_OPAQUE_SIZE]; /* Same size as opaque[] buffer of struct livepatch_func. This is the max number of bytes to be patched */ }; typedef struct livepatch_expectation livepatch_expectation_t;
- enabled allows to enable the expectation check for given function. Default state is disabled.
- len specifies the number of valid bytes in data array. 5 bits is enough to specify values up to 32 (of bytes), which is above the array size.
- rsv reserved bitfields. MUST be zero.
- data contains expected bytes of content to be replaced. Same size as opaque buffer of struct livepatch_func (max number of bytes to be patched).

The size of the livepatch_func array is determined from the ELF section size.

When applying the patch the hypervisor iterates over each livepatch_func structure and the core code inserts a trampoline at old_addr to new_addr. The new_addr is altered when the ELF payload is loaded.

When reverting a patch, the hypervisor iterates over each livepatch_func and the core code copies the data from the undo buffer (private internal copy) to old_addr.

It optionally may contain the address of hooks to be called right before being applied and after being reverted (while all CPUs are still in quiescent zone). These hooks do not have access to payload structure.

.livepatch.hooks.load - an array of function pointers.
.livepatch.hooks.unload - an array of function pointers.

It optionally may also contain the address of pre- and post- vetoing hooks to be called before (pre) or after (post) apply and revert payload actions (while all CPUs are already released from quiescent zone). These hooks do have access to payload structure. The pre-apply hook can prevent from loading the payload if encoded in it condition is not met. Accordingly, the pre-revert hook can prevent from unloading the livepatch if encoded in it condition is not met.

.livepatch.hooks.{preapply,postapply}
.livepatch.hooks.{prerevert,postrevert}
- which are a pointer to a single hook function pointer.

Finally, it optionally may also contain the address of apply or revert action hooks to be called instead of the default apply and revert payload actions (while all CPUs are kept in quiescent zone). These hooks do have access to payload structure.

.livepatch.hooks.{apply,revert}
- which are a pointer to a single hook function pointer.

1.7.2 Example of .livepatch.funcs

A simple example of what a payload file can be:

/* MUST be in sync with hypervisor. */
struct livepatch_func {
    const char *name;
    void *new_addr;
    void *old_addr;
    uint32_t new_size;
    uint32_t old_size;
    uint8_t version;
    uint8_t pad[31];
    /* Added to livepatch payload version 2: */
    uint8_t applied;
    uint8_t _pad[7];
    livepatch_expectation_t expect;
};

/* Our replacement function for xen_extra_version. */
const char *xen_hello_world(void)
{
    return "Hello World";
}

static unsigned char patch_this_fnc[] = "xen_extra_version";

struct livepatch_func livepatch_hello_world = {
    .version = LIVEPATCH_PAYLOAD_VERSION,
    .name = patch_this_fnc,
    .new_addr = xen_hello_world,
    .old_addr = (void *)0xffff82d08013963c, /* Extracted from xen-syms. */
    .new_size = 13, /* To be be computed by scripts. */
    .old_size = 13, /* -----------""---------------  */
    /* Added to livepatch payload version 2: */
    .expect = { /* All fields to be filled manually */
        .enabled = 1,
        .len = 5,
        .rsv = 0,
        .data = { 0x48, 0x8d, 0x05, 0x33, 0x1C }
    },
} __attribute__((__section__(".livepatch.funcs")));

Code must be compiled with -fPIC.

1.7.3 Hooks

1.7.3.1 .livepatch.hooks.load and .livepatch.hooks.unload

This section contains an array of function pointers to be executed before payload is being applied (.livepatch.funcs) or after reverting the payload. This is useful to prepare data structures that need to be modified patching.