Revision 3
The domain save image is the context of a running domain used for snapshots of a domain or for transferring domains between hosts during migration.
There are a number of problems with the format of the domain save image used in Xen 4.4 and earlier (the legacy format).
Dependant on toolstack word size. A number of fields within the
image are native types such as unsigned long
which have
different sizes between 32-bit and 64-bit toolstacks. This prevents
domains from being migrated between hosts running 32-bit and 64-bit
toolstacks.
There is no header identifying the image.
The image has no version information.
A new format that addresses the above is required.
ARM does not yet have have a domain save image format specified and the format described in this specification should be suitable.
The following features are not yet fully specified and will be included in a future draft.
Page data compression.
ARM
The image format consists of two main sections:
There are two headers: the image header, and the domain header. The image header describes the format of the image (version etc.). The domain header contains general information about the domain (architecture, type etc.).
The main part of the format is a sequence of different records. Each record type contains information about the domain context. At a minimum there is a END record marking the end of the records section.
All the fields within the headers and records have a fixed width.
Fields are always aligned to their size.
Padding and reserved fields are set to zero on save and must be ignored during restore.
Integer (numeric) fields in the image header are always in big-endian byte order.
Integer fields in the domain header and in the records are in the endianness described in the image header (which will typically be the native ordering).
The image header identifies an image as a Xen domain save image. It includes the version of this specification that the image complies with.
Tools supporting version V of the specification shall always save images using version V. Tools shall support restoring from version V. If the previous Xen release produced version V - 1 images, tools shall supported restoring from these. Tools may additionally support restoring from earlier versions.
The marker field can be used to distinguish between legacy images and those corresponding to this specification. Legacy images will have at one or more zero bits within the first 8 octets of the image.
Fields within the image header are always in big-endian byte order, regardless of the setting of the endianness bit.
0 1 2 3 4 5 6 7 octet
+-------------------------------------------------+
| marker |
+-----------------------+-------------------------+
| id | version |
+-----------+-----------+-------------------------+
| options | (reserved) |
+-----------+-------------------------------------+
Field | Description |
---|---|
marker | 0xFFFFFFFFFFFFFFFF. |
id | 0x58454E46 (“XENF” in ASCII). |
version | 0x00000003. The version of this specification. |
options | bit 0: Endianness. 0 = little-endian, 1 = big-endian. |
bit 1-15: Reserved. |
The endianness shall be 0 (little-endian) for images generated on an i386, x86_64, or arm host.
The domain header includes general properties of the domain.
0 1 2 3 4 5 6 7 octet
+-----------------------+-----------+-------------+
| type | page_shift| (reserved) |
+-----------------------+-----------+-------------+
| xen_major | xen_minor |
+-----------------------+-------------------------+
Field | Description |
---|---|
type | 0x0000: Reserved. |
0x0001: x86 PV. | |
0x0002: x86 HVM. | |
0x0003 - 0xFFFFFFFF: Reserved. | |
page_shift | Size of a guest page as a power of two. |
i.e., page size = 2 page_shift. | |
xen_major | The Xen major version when this image was saved. |
xen_minor | The Xen minor version when this image was saved. |
The legacy stream conversion tool writes a xen_major
version of 0, and sets xen_minor
to the version of
itself.
A record has a record header, type specific data and a trailing
footer. If body_length
is not a multiple of 8, the body is
padded with zeroes to align the end of the record on an 8 octet
boundary.
0 1 2 3 4 5 6 7 octet
+-----------------------+-------------------------+
| type | body_length |
+-----------+-----------+-------------------------+
| body... |
...
| | padding (0 to 7 octets) |
+-----------+-------------------------------------+
Field | Description |
---|---|
type | 0x00000000: END |
0x00000001: PAGE_DATA | |
0x00000002: X86_PV_INFO | |
0x00000003: X86_PV_P2M_FRAMES | |
0x00000004: X86_PV_VCPU_BASIC | |
0x00000005: X86_PV_VCPU_EXTENDED | |
0x00000006: X86_PV_VCPU_XSAVE | |
0x00000007: SHARED_INFO | |
0x00000008: X86_TSC_INFO | |
0x00000009: HVM_CONTEXT | |
0x0000000A: HVM_PARAMS | |
0x0000000B: TOOLSTACK (deprecated) | |
0x0000000C: X86_PV_VCPU_MSRS | |
0x0000000D: VERIFY | |
0x0000000E: CHECKPOINT | |
0x0000000F: CHECKPOINT_DIRTY_PFN_LIST (Secondary -> Primary) | |
0x00000010: STATIC_DATA_END | |
0x00000011: X86_CPUID_POLICY | |
0x00000012: X86_MSR_POLICY | |
0x00000013 - 0x7FFFFFFF: Reserved for future mandatory records. | |
0x80000000 - 0xFFFFFFFF: Reserved for future optional records. | |
body_length | Length in octets of the record body. |
body | Content of the record. |
padding | 0 to 7 octets of zeros to pad the whole record to a multiple of 8 octets. |
Records may be mandatory or optional. Optional records have bit 31 set in their type. Restoring an image that has unrecognised or unsupported mandatory record must fail. The contents of optional records may be ignored during a restore.
The following sub-sections specify the record body format for each of the record types.
An end record marks the end of the image, and shall be the final record in the stream.
0 1 2 3 4 5 6 7 octet
+-------------------------------------------------+
The end record contains no fields; its body_length is 0.
The bulk of an image consists of many PAGE_DATA records containing the memory contents.
0 1 2 3 4 5 6 7 octet
+-----------------------+-------------------------+
| count (C) | (reserved) |
+-----------------------+-------------------------+
| pfn[0] |
+-------------------------------------------------+
...
+-------------------------------------------------+
| pfn[C-1] |
+-------------------------------------------------+
| page_data[0]... |
...
+-------------------------------------------------+
| page_data[N-1]... |
...
+-------------------------------------------------+
Field | Description |
---|---|
count | Number of pages described in this record. |
pfn | An array of count PFNs and their types. |
Bit 63-60: XEN_DOMCTL_PFINFO_* type (from
public/domctl.h but shifted by 32 bits) |
|
Bit 59-52: Reserved. | |
Bit 51-0: PFN. | |
page_data | page_size octets of uncompressed page contents for each page set as present in the pfn array. |
Note: Count is strictly > 0. N is strictly <= C and it is possible for there to be no page_data in the record if all pfns are of invalid types.
PFINFO type | Value | Description |
---|---|---|
NOTAB | 0x0 | Normal page. |
L1TAB | 0x1 | L1 page table page. |
L2TAB | 0x2 | L2 page table page. |
L3TAB | 0x3 | L3 page table page. |
L4TAB | 0x4 | L4 page table page. |
0x5-0x8 | Reserved. | |
L1TAB_PIN | 0x9 | L1 page table page (pinned). |
L2TAB_PIN | 0xA | L2 page table page (pinned). |
L3TAB_PIN | 0xB | L3 page table page (pinned). |
L4TAB_PIN | 0xC | L4 page table page (pinned). |
BROKEN | 0xD | Broken page. |
XALLOC | 0xE | Allocate only. |
XTAB | 0xF | Invalid page. |
PFNs with type BROKEN
, XALLOC
, or
XTAB
do not have any corresponding
page_data
.
The saver uses the XTAB
type for PFNs that become
invalid in the guest’s P2M table during a live migration1.
Restoring an image with unrecognised page types shall fail.
0 1 2 3 4 5 6 7 octet
+-----+-----+-----------+-------------------------+
| w | ptl | (reserved) |
+-----+-----+-----------+-------------------------+
Field | Description |
---|---|
guest_width (w) | Guest width in octets (either 4 or 8). |
pt_levels (ptl) | Number of page table levels (either 3 or 4). |
0 1 2 3 4 5 6 7 octet
+-----+-----+-----+-----+-------------------------+
| p2m_start_pfn (S) | p2m_end_pfn (E) |
+-----+-----+-----+-----+-------------------------+
| p2m_pfn[p2m frame containing pfn S] |
+-------------------------------------------------+
...
+-------------------------------------------------+
| p2m_pfn[p2m frame containing pfn E] |
+-------------------------------------------------+
Field | Description |
---|---|
p2m_start_pfn | First pfn index in the p2m_pfn array. |
p2m_end_pfn | Last pfn index in the p2m_pfn array. |
p2m_pfn | Array of PFNs containing the guest’s P2M table, for the PFN frames containing the PFN range S to E (inclusive). |
The format of these records are identical. They are all binary blobs of data which are accessed using specific pairs of domctl hypercalls.
0 1 2 3 4 5 6 7 octet
+-----------------------+-------------------------+
| vcpu_id | (reserved) |
+-----------------------+-------------------------+
| context... |
...
+-------------------------------------------------+
Field | Description |
---|---|
vcpu_id | The VCPU ID. |
context | Binary data for this VCPU. |
Record type | Accessor hypercalls |
---|---|
X86_PV_VCPU_BASIC | XEN_DOMCTL_{get,set}vcpucontext |
X86_PV_VCPU_EXTENDED | XEN_DOMCTL_{get,set}_ext_vcpucontext |
X86_PV_VCPU_XSAVE | XEN_DOMCTL_{get,set}vcpuextstate |
X86_PV_VCPU_MSRS | XEN_DOMCTL_{get,set}_vcpu_msrs |
The content of the Shared Info page.
0 1 2 3 4 5 6 7 octet
+-------------------------------------------------+
| shared_info |
...
+-------------------------------------------------+
Field | Description |
---|---|
shared_info | Contents of the shared info page. This record should be exactly 1 page long. |
Domain TSC information, as accessed by the XEN_DOMCTL_{get,set}tscinfo hypercall sub-ops.
0 1 2 3 4 5 6 7 octet
+------------------------+------------------------+
| mode | khz |
+------------------------+------------------------+
| nsec |
+------------------------+------------------------+
| incarnation | (reserved) |
+------------------------+------------------------+
Field | Description |
---|---|
mode | TSC mode, TSC_MODE_* constant. |
khz | TSC frequency, in kHz. |
nsec | Elapsed time, in nanoseconds. |
incarnation | Incarnation. |
HVM Domain context, as accessed by the XEN_DOMCTL_{get,set}hvmcontext hypercall sub-ops.
0 1 2 3 4 5 6 7 octet
+-------------------------------------------------+
| hvm_ctx |
...
+-------------------------------------------------+
Field | Description |
---|---|
hvm_ctx | The HVM Context blob from Xen. |
HVM Domain parameters, as accessed by the HVMOP_{get,set}_param hypercall sub-ops.
0 1 2 3 4 5 6 7 octet
+------------------------+------------------------+
| count (C) | (reserved) |
+------------------------+------------------------+
| param[0].index |
+-------------------------------------------------+
| param[0].value |
+-------------------------------------------------+
...
+-------------------------------------------------+
| param[C-1].index |
+-------------------------------------------------+
| param[C-1].value |
+-------------------------------------------------+
Field | Description |
---|---|
count | The number of parameters contained in this record. Each parameter in the record contains an index and value. |
param index | Parameter index. |
param value | Parameter value. |
This record was only present for transitionary purposes during development. It is should not be used.
An opaque blob provided by and supplied to the higher layers of the toolstack (e.g., libxl) during save and restore.
0 1 2 3 4 5 6 7 octet
+------------------------+------------------------+
| data |
...
+-------------------------------------------------+
Field | Description |
---|---|
data | Blob of toolstack-specific data. |
A verify record indicates that, while all memory has now been sent, the sender shall send further memory records for debugging purposes.
0 1 2 3 4 5 6 7 octet
+-------------------------------------------------+
The verify record contains no fields; its body_length is 0.
A checkpoint record indicates that all the preceding records in the stream represent a consistent view of VM state.
0 1 2 3 4 5 6 7 octet
+-------------------------------------------------+
The checkpoint record contains no fields; its body_length is 0
If the stream is embedded in a higher level toolstack stream, the CHECKPOINT record marks the end of the libxc portion of the stream and the stream is handed back to the higher level for further processing.
The higher level stream may then hand the stream back to libxc to process another set of records for the next consistent VM state snapshot. This next set of records may be terminated by another CHECKPOINT record or an END record.
A checkpoint dirty pfn list record is used to convey information about dirty memory in the VM. It is an unordered list of PFNs. Currently only applicable in the backchannel of a checkpointed stream. It is only used by COLO, more detail please reference README.colo.
0 1 2 3 4 5 6 7 octet
+-------------------------------------------------+
| pfn[0] |
+-------------------------------------------------+
...
+-------------------------------------------------+
| pfn[C-1] |
+-------------------------------------------------+
The count of pfns is: record->length/sizeof(uint64_t).
A static data end record marks the end of the static state. I.e. state which is invariant of guest execution.
0 1 2 3 4 5 6 7 octet
+-------------------------------------------------+
The end record contains no fields; its body_length is 0.
CPUID policy content, as accessed by the XEN_DOMCTL_{get,set}_cpu_policy hypercall sub-ops.
0 1 2 3 4 5 6 7 octet
+-------------------------------------------------+
| CPUID_policy |
...
+-------------------------------------------------+
Field | Description |
---|---|
CPUID_policy | Array of xen_cpuid_leaf_t[]’s |
MSR policy content, as accessed by the XEN_DOMCTL_{get,set}_cpu_policy hypercall sub-ops.
0 1 2 3 4 5 6 7 octet
+-------------------------------------------------+
| MSR_policy |
...
+-------------------------------------------------+
Field | Description |
---|---|
MSR_policy | Array of xen_msr_entry_t[]’s |
The set of valid records depends on the guest architecture and type. No assumptions should be made about the ordering or interleaving of independent records. Record dependencies are noted below.
Some records are used for signalling, and explicitly have zero length. All other records contain data relevant to the migration. Data records with no content should be elided on the source side, as their presence serves no purpose, but results in extra work for the restore side.
A typical save record for an x86 PV guest image would look like:
There are some strict ordering requirements. The following records must be present in the following order as each of them depends on information present in the preceding ones.
A typical save record for an x86 HVM guest image would look like:
HVM_PARAMS must precede HVM_CONTEXT, as certain parameters can affect the validity of architectural state in the context.
A v3 stream is compatible with a v2 stream, but mandates the presense of a STATIC_DATA_END record ahead of any memory/register content. This is to ease the introduction of new static configuration records over time.
A v3-compatible reciever interpreting a v2 stream should infer the position of STATIC_DATA_END based on finding the first X86_PV_P2M_FRAMES record (for PV guests), or PAGE_DATA record (for HVM guests) and behave as if STATIC_DATA_END had been sent.
Restoring legacy images from older tools shall be handled by translating the legacy format image into this new format.
It shall not be possible to save in the legacy format.
There are two different legacy images depending on whether they were generated by a 32-bit or a 64-bit toolstack. These shall be distinguished by inspecting octets 4-7 in the image. If these are zero then it is a 64-bit image.
Toolstack | Field | Value |
---|---|---|
64-bit | Bit 31-63 of the p2m_size field | 0 (since p2m_size < 232) |
32-bit | extended-info chunk ID (PV) | 0xFFFFFFFF |
32-bit | Chunk type (HVM) | < 0 |
32-bit | Page count (HVM) | > 0 |
This assumes the presence of the extended-info chunk which was introduced in Xen 3.0.
All changes to this specification should bump the revision number in the title block.
All changes to the image or domain headers require the image version to be increased.
The format may be extended by adding additional record types.
Extending an existing record type must be done by adding a new record type. This allows old images with the old record to still be restored.
The image header may only be extended by appending
additional fields. In particular, the marker
,
id
and version
fields must never change size
or location.
In the legacy format, this is the list of unmapped PFNs in the tail.↩︎