Xen transport for 9pfs version 1

Background

9pfs is a network filesystem protocol developed for Plan 9. 9pfs is very simple and describes a series of commands and responses. It is completely independent from the communication channels, in fact many clients and servers support multiple channels, usually called "transports". For example the Linux client supports tcp and unix sockets, fds, virtio and rdma.

9pfs protocol

This document won't cover the full 9pfs specification. Please refer to this [paper] and this [website] for a detailed description of it. However it is useful to know that each 9pfs request and response has the following header:

struct header {
    uint32_t size;
    uint8_t id;
    uint16_t tag;
} __attribute__((packed));

0         4  5    7
+---------+--+----+
|  size   |id|tag |
+---------+--+----+

It is possible to have multiple requests in-flight at any given time.

Rationale

This document describes a Xen based transport for 9pfs, in the traditional PV frontend and backend format. The PV frontend is used by the client to send commands to the server. The PV backend is used by the 9pfs server to receive commands from clients and send back responses.

The transport protocol supports multiple rings up to the maximum supported by the backend. The size of every ring is also configurable and can span multiple pages, up to the maximum supported by the backend (although it cannot be more than 2MB). The design is to exploit parallelism at the vCPU level and support multiple outstanding requests simultaneously.

This document does not cover the 9pfs client/server design or implementation, only the transport for it.

Xenstore

The frontend and the backend connect via xenstore to exchange information. The toolstack creates front and back nodes with state [XenbusStateInitialising]. The protocol node name is 9pfs.

Multiple rings are supported for each frontend and backend connection.

Backend XenBus Nodes

Backend specific properties, written by the backend, read by the frontend:

versions
     Values:         <string>

     List of comma separated protocol versions supported by the backend.
     For example "1,2,3". Currently the value is just "1", as there is
     only one version. N.B.: this is the version of the Xen trasport
     protocol, not the version of 9pfs supported by the server.

max-rings
     Values:         <uint32_t>

     The maximum supported number of rings per frontend.

max-ring-page-order
     Values:         <uint32_t>

     The maximum supported size of a memory allocation in units of
     log2n(machine pages), e.g. 1 = 2 pages, 2 == 4 pages, etc. It
     must be at least 1.

Backend configuration nodes, written by the toolstack, read by the backend:

path
     Values:         <string>

     Host filesystem path to share.

tag
     Values:         <string>

     Alphanumeric tag that identifies the 9pfs share. The client needs
     to know the tag to be able to mount it.

security-model
     Values:         "none"

     *none*: files are stored using the same credentials as they are
             created on the guest (no user ownership squash or remap)
     Only "none" is supported in this version of the protocol.

Frontend XenBus Nodes

version
     Values:         <string>

     Protocol version, chosen among the ones supported by the backend
     (see **versions** under [Backend XenBus Nodes]). Currently the
     value must be "1".

num-rings
     Values:         <uint32_t>

     Number of rings. It needs to be lower or equal to max-rings.

event-channel-<num> (event-channel-0, event-channel-1, etc)
     Values:         <uint32_t>

     The identifier of the Xen event channel used to signal activity
     in the ring buffer. One for each ring.

ring-ref<num> (ring-ref0, ring-ref1, etc)
     Values:         <uint32_t>

     The Xen grant reference granting permission for the backend to
     map a page with information to setup a share ring. One for each
     ring.

State Machine

Initialization:

*Front*                               *Back*
XenbusStateInitialising               XenbusStateInitialising
- Query virtual device                - Query backend device
  properties.                           identification data.
- Setup OS device instance.           - Publish backend features
- Allocate and initialize the           and transport parameters
  request ring.                                      |
- Publish transport parameters                       |
  that will be in effect during                      V
  this connection.                            XenbusStateInitWait
             |
             |
             V
   XenbusStateInitialised

                                      - Query frontend transport parameters.
                                      - Connect to the request ring and
                                        event channel.
                                                     |
                                                     |
                                                     V
                                             XenbusStateConnected

 - Query backend device properties.
 - Finalize OS virtual device
   instance.
             |
             |
             V
    XenbusStateConnected

Once frontend and backend are connected, they have a shared page per ring, which are used to setup the rings, and an event channel per ring, which are used to send notifications.

Shutdown:

*Front*                            *Back*
XenbusStateConnected               XenbusStateConnected
            |
            |
            V
   XenbusStateClosing

                                   - Unmap grants
                                   - Unbind evtchns
                                             |
                                             |
                                             V
                                     XenbusStateClosing

- Unbind evtchns
- Free rings
- Free data structures
           |
           |
           V
   XenbusStateClosed

                                   - Free remaining data structures
                                             |
                                             |
                                             V
                                     XenbusStateClosed

Ring Setup

The shared page has the following layout:

typedef uint32_t XEN_9PFS_RING_IDX;

struct xen_9pfs_intf {
    XEN_9PFS_RING_IDX in_cons, in_prod;
    uint8_t pad[56];
    XEN_9PFS_RING_IDX out_cons, out_prod;
    uint8_t pad[56];

    uint32_t ring_order;
    /* this is an array of (1 << ring_order) elements */
    grant_ref_t ref[1];
};

/* not actually C compliant (ring_order changes from ring to ring) */
struct ring_data {
    char in[((1 << ring_order) << PAGE_SHIFT) / 2];
    char out[((1 << ring_order) << PAGE_SHIFT) / 2];
};

The binary layout of struct xen_9pfs_intf follows:

0         4         8           64        68        72        76 
+---------+---------+-----//-----+---------+---------+---------+
| in_cons | in_prod |  padding   |out_cons |out_prod |ring_orde|
+---------+---------+-----//-----+---------+---------+---------+

76        80        84      4092      4096
+---------+---------+----//---+---------+
|  ref[0] |  ref[1] |         |  ref[N] |
+---------+---------+----//---+---------+

N.B For one page, N is maximum 991 (4096-132)/4, but given that N needs to be a power of two, actually max N is 512. As 512 == (1 << 9), the maximum possible max-ring-page-order value is 9.

The binary layout of the ring buffers follow:

0         ((1<<ring_order)<<PAGE_SHIFT)/2       ((1<<ring_order)<<PAGE_SHIFT)
+------------//-------------+------------//-------------+
|            in             |           out             |
+------------//-------------+------------//-------------+

Why ring.h is not needed

Many Xen PV protocols use the macros provided by [ring.h] to manage their shared ring for communication. This procotol does not, because it actually comes with two rings: the in ring and the out ring. Each of them is mono-directional, and there is no static request size: the producer writes opaque data to the ring. On the other end, in [ring.h] they are combined, and the request size is static and well-known. In this protocol:

in -> backend to frontend only out-> frontend to backend only

In the case of the in ring, the frontend is the consumer, and the backend is the producer. Everything is the same but mirrored for the out ring.

The producer, the backend in this case, never reads from the in ring. In fact, the producer doesn't need any notifications unless the ring is full. This version of the protocol doesn't take advantage of it, leaving room for optimizations.

On the other end, the consumer always requires notifications, unless it is already actively reading from the ring. The producer can figure it out, without any additional fields in the protocol, by comparing the indexes at the beginning and the end of the function. This is similar to what [ring.h] does.

Ring Usage

The in and out arrays are used as circular buffers:

0                               sizeof(array) == ((1<<ring_order)<<PAGE_SHIFT)/2
+-----------------------------------+
|to consume|    free    |to consume |
+-----------------------------------+
           ^            ^
           prod         cons

0                               sizeof(array)
+-----------------------------------+
|  free    | to consume |   free    |
+-----------------------------------+
           ^            ^
           cons         prod

The following functions are provided to read and write to an array:

#define MASK_XEN_9PFS_IDX(idx) ((idx) & (XEN_9PFS_RING_SIZE - 1))

static inline void xen_9pfs_read(char *buf,
        XEN_9PFS_RING_IDX *masked_prod, XEN_9PFS_RING_IDX *masked_cons,
        uint8_t *h, size_t len) {
    if (*masked_cons < *masked_prod) {
        memcpy(h, buf + *masked_cons, len);
    } else {
        if (len > XEN_9PFS_RING_SIZE - *masked_cons) {
            memcpy(h, buf + *masked_cons, XEN_9PFS_RING_SIZE - *masked_cons);
            memcpy((char *)h + XEN_9PFS_RING_SIZE - *masked_cons, buf, len - (XEN_9PFS_RING_SIZE - *masked_cons));
        } else {
            memcpy(h, buf + *masked_cons, len);
        }
    }
    *masked_cons = _MASK_XEN_9PFS_IDX(*masked_cons + len);
}

static inline void xen_9pfs_write(char *buf,
        XEN_9PFS_RING_IDX *masked_prod, XEN_9PFS_RING_IDX *masked_cons,
        uint8_t *opaque, size_t len) {
    if (*masked_prod < *masked_cons) {
        memcpy(buf + *masked_prod, opaque, len);
    } else {
        if (len > XEN_9PFS_RING_SIZE - *masked_prod) {
            memcpy(buf + *masked_prod, opaque, XEN_9PFS_RING_SIZE - *masked_prod);
            memcpy(buf, opaque + (XEN_9PFS_RING_SIZE - *masked_prod), len - (XEN_9PFS_RING_SIZE - *masked_prod)); 
        } else {
            memcpy(buf + *masked_prod, opaque, len); 
        }
    }
    *masked_prod = _MASK_XEN_9PFS_IDX(*masked_prod + len);
}

The producer (the backend for in, the frontend for out) writes to the array in the following way:

The consumer (the backend for out, the frontend for in) reads from the array in the following way:

The producer takes care of writing only as many bytes as available in the buffer up to cons. The consumer takes care of reading only as many bytes as available in the buffer up to prod.

Request/Response Workflow

The client chooses one of the available rings, then it sends a request to the other end on the out array, following the producer workflow described in [Ring Usage].

The server receives the notification and reads the request, following the consumer workflow described in [Ring Usage]. The server knows how much to read because it is specified in the size field of the 9pfs header. The server processes the request and sends back a response on the in array of the same ring, following the producer workflow as usual. Thus, every request/response pair is on one ring.

The client receives a notification and reads the response from the in array. The client knows how much data to read because it is specified in the size field of the 9pfs header.