# NUMA-Aware Ballooning Design and Implementation #

## Rationale ##

The purpose of this document is describing how the ballooning driver for
Linux is made NUMA-aware, why this is important, and its intended usage on
a NUMA host.

NUMA stands for Non-Uniform Memory Access, which means that a typical NUMA
host will have more than one memory controller and that accessing a specific
memory locations can have a varying cost, depending on from which physical
CPU such accesses come from. In presence of such architecture, it is very
important to keep all the memory of a domain on the same pysical NUMA node
(or on the smallest possible set of physical NUMA nodes), in order to get
good and predictable performance.

Ballooning is an effective mean for making space, when there is the need to
create a new domain. However, on a NUMA host, one could want to not only
free some memory, but also to free it from a specific node. That could be,
for instance, the case for making sure the new domain will fit in just one
node.

For more information on NUMA and on how Xen deals with it, see [Xen NUMA
Introduction][numa_intro] on the Xen Wiki, and/or
[xl-numa-placement.markdown][numa_placement].

## Definitions ##

Some definitions that could came handy for reading the rest of this
document:

* physical topology -- the hardware NUMA characteristics of the host. It
  typically is comprised of the number of NUMA nodes, the memory each node
  has and the number of (and the which ones) physical CPUs belongs to each
  node.
* virtual topology -- similar information to the physical topology, but it
  applies to a guest, rather than to the host (hence the virtual). It is
  constructed at guest creation time and is comprised of the number of
  virtual NUMA nodes, the amount of guest memory each virtual node is
  entitled of and the number of (and the which ones) virtual vCPUs belongs
  to each virtual node.

Physical node may be abbreviated as pnode, or p-node, and virtual node is
often abbreviated as vnode, or v-node (similarly to what happens to pCPU
and vCPU).

### Physical topology and virtual topology ###

It is __not__ __at__ __all__ __required__ that there is any relationship
between the physical and virtual topology, exactly as it is not at all
required that there is any relationship between physical CPUs and virtual
CPUs.

That is to say, on an host with 4 NUMA nodes, each node with 8 GiB RAM and
4 pCPUs, there can be guests with 1 vnode, 2 vnodes and 4 vnodes (it is not
clear yet whether it would make sense to allow overbooking, perhaps yes, but
only after having warn the user about it). In all the cases, vnodes can have
arbitrary sizes, with respect to both the amount of memory and the number of
vCPUs in each of them. Similarly, there is neither the requirement nor the
guarantee that, either:

* the memory of a vnode is actually allocated on a particular pnode, nor
  even that it is allocated on __only__ one pnode
* the vCPUs belonging to a vnode are actually bound or pinned to the pCPUS
  of a particular pnode.

Of course, although it is not required, having some kind of relationship in
place between the physical and virtual topology would be beneficial for both
performance and resource usage optimization. This is really similar to what
happens with vCPU to pCPU pinning: it is not required, but if it is possible
to do that properly, performance will increase.

In this specific case, the ideal situation would be as follows:

* all the guest memory pages that constitutes one vnode, say vnode #1, are
  backed by frames that, on the host, belongs to the same pnode, say pnode #3
  (and that should be true for all the vnodes);
* all the guest vCPUs that belongs to vnode #1 have, as far as the Xen
  scheduler is concerned, NUMA node affinity with (some of) the pCPUs
  belonging to pnode #3 (and that again should hold for all the vnodes).

For that reason, a mechanism for the user to specify the relationship between
virtual and physical topologies at guest creation time will be implemented, as
well as some automated logic for coming up with a sane default, in case the
user does not say anything (like it happens with NUMA automatic placement).

## Current status ##

At the time of writing this document, the work to allow a virtual NUMA topology
to be specified and properly constructed within a domain is still in progress.
Patches will be released soon, both for Xen and Linux, implementing right what
has been described in the previous section.

NUMA aware ballooning needs such feature to be present, or it won't work
properly, so the reminder of this document will assume that we have virtual
NUMA in place.

We will also assume that we have the following "services" in place too,
provided by the Xen hypervisor, via a suitable hypercall interface, and
available for the toolstack and to the ballooning driver.

### nodemask VNODE\_TO\_PNODE(int vnode) ###

This service is provided by the hypervisor (and wired, if necessary, all the
way up to the proper toolstack layer or guest kernel), since it is only Xen
that knows both the virtual and the physical topologies.

It takes the ID of a vnode, as they are identified within the virtual topology
of a guest, and returns a bitmask. The bits set to 1 in there, corresponds to
the pnodes from which, on the host, the memory of the passed vnode comes from.

The ideal situation is that such bitmask only has 1 bit set, since having the
memory of a vnode coming from more than one pnode is clearly bad for access
locality.

### nodemask PNODE\_TO\_VNODE(int pnode) ###

This service is provided by the hypervisor (and wired, if necessary, all the
way up to the proper toolstack layer or guest kernel), since it is only Xen
that knows both the virtual and the physical topologies.

It takes the ID of a pnode, as they are identified within the physical
topology of the host, and returns a bitmask. The bits set to 1 in there,
corresponds to the vnodes, in the guest, that uses memory from the passed
pnode.

The ideal situation is that such bitmask only has 1 bit set, but is less of a
problem (wrt the case above) it there are more.

## Description of the problem ##

Let us use an example. Let's assume that guest _G_ has a virtual 2 vnodes,
and that the memory for vnode #0 and #1 comes from pnode #0 and pnode #2,
respectively.

Now, the user wants to create a new guest, but the system is under high memory
pressure, so he decides to try ballooning _G_ down. He sees that pnode #2 has
the best chances to accommodate all the memory for the new guest, which would
be really good for performance, if only he can make space there. _G_ is the
only domain eating some memory from pnode, #2 but, as said above, not all of
its memory comes from there.

So, right now, the user has no way to specify that he wants to balloon down
_G_ in such a way that he will get as much as possible free pages from pnode
#2, rather than from pnode #0. He can ask _G_ to balloon down, but there is
no guarantee on from what pnode the memory will be freed.

The same applies to the ballooning up case, when the user, for some specific
reasons, wants to be sure that it is the memory of some (other) specific pnode
that will be used.

What is necessary is something allowing the user to specify from what pnode
he wants the free pages, and allowing the ballooning driver to make that
happen.

## The current situation ##

This section describes the details of how ballooning works right now. It is
possible to skip this if you already possess such knowledge.

Ballooning happens inside the guest, with of course some "help" from Xen.
The notification that ballooning up or down is necessary reaches the
ballooning driver in the guest via xenstore. See [here][xenstore] and
[here][xenstore_paths] for more details.

Typically, the user --via xl or via any other high level toolstack-- calls
libxl\_set\_memory\_target(), where the new desired memory target for the
guest is written in a xenstore key: ~/memory/target.  
The guest has a xenstore watch on this key, and every time it changes, work
inside the guest itself is scheduled (the body is in the balloon\_process()
function). It is responsibility of this component of the ballooning driver
to read the key and take proper action.

There are two possible situations:

* new target < current usage -- the ballooning driver maintains a list of
  free pages (similar to what OS does, for tracking free memory). Despite
  being actually free, they are hidden to the OS, which can't use them,
  unless they are handed to it by the ballooning. From the OS perspective,
  the ballooning driver can be seen as some sort of monster, eating some of
  the guest free memory.  
  To steal a free page from the guest OS, and add it to the list of ballooned
  pages, the driver calls alloc\_page(). From the point of view of the OS,
  the ballooning monster ate that page, decreasing the amount of free memory
  (inside the guest).  
  At this point, a XENMEM\_decrease\_reservation hypercall is issued, which
  hands the newly allocated page to Xen so that he can unmap it, turning it,
  to all the effects, into free host memory.
* new target > current usage -- the ballooning driver picks up one page from
  the list of ballooned pages it maintains and issue a
  XENMEM\_populate\_physmap hypercall, with the PFN of the chosen page as an
  argument.  
  Xen goes looking for some free memory in the host and, when it finds it,
  allocate a page and map that in the guest's memory space, using the PFN it
  has been provided with.  
  At this point, the ballooning driver calls \_\_free\_reserved\_page() to
  make the guest OS know that the PFN can now be considered a free page
  (again), which also means removing the page from the ballooned pages list.

Operations like the ones described above goes on within the ballooning driver,
perhaps one after the other, until the new target is reached.

## NUMA-aware ballooning ##

The new NUMA-aware ballooning logic works as follows.

There is room, in libxl\_set\_memory\_target() for two more parameters, in
addition to the new memory target:

* _pnid_ -- which is the pnode id of which node the user wants to try get some
  free memory on
* _nodeexact_ -- which is a bool specifying whether or not, in case it is not
  possible to reach the new ballooning target only with memory from pnode
  _pnid_, the user is fine with using memory from other pnodes.  
  If _nodeexact_ is true, it is possible that the new target is not reached; if
  it is false, the new target will (probably) be reached, but it is possible
  that some memory is freed on pnodes other than _pnid_.

To let the ballooning driver know about these new parameters, a new xenstore
key exists in ~/memory/target\_nid. So, for a proper NUMA aware ballooning
operation to occur, the user should write the proper values in both the keys:
~/memory/target\_nid and ~/memory/target.

So, in NUMA aware ballooning, ballooning down and up works as follows:

* target < current usage -- first of all, the ballooning driver uses the
  PNODE\_TO\_VNODE() service (provided by the virtual topology implementation,
  as explained above) to translate _pnid_ (that it reads from xenstore) to
  the id(s) of the corresponding set of vnode IDs, say _{vnids}_ (which will
  be be a one element only set, in case there is only one vnode in the guest
  allocated out of pnode _pnid_). It then uses alloc_pages_node(), with one
  or more elements from _{vnids}_ as nodes, in order to rip the proper amount
  of ballooned pages out of the guest OS, and hand them to Xen (with the same
  hypercal as above).  
  If not enough pages can be found, and only if _nodeexact_ is false, it starts
  considering other nodes (by just calling alloc\_page()). If _nodeexact_ is
  true, it just returns.
* target > current usage -- first of all, the ballooning driver uses the
  PNODE\_TO\_VNODE() service to translate _pnid_ to the ID(s) of the
  corresponding set of vnode IDs, say _{vnids}_. It then looks for enough
  ballooned pages, among the ones belonging to the vnodes in _{vnids}_, and
  asks Xen to map them via XENMEM\_polulate\_physmap. While doing the latter,
  it explicitly tells the hypervisor to allocate the actual host pages on
  pnode _pnid_.  
  If not enough ballooned pages can be found among vnodes in _{vnids}_, and
  only if node\_exact is false, it falls back looking for ballooned pages in
  other vnodes. For each one he finds, it calls VNODE\_TO\_PNODE(), to see on
  what pnode pnode it belongs, and then asks Xen to map it, exactly as above.

The biggest difference between current and NUMA-aware ballooning is that the
latter needs to keep multiple lists of the ballooned pages in an array, with
one element for each virtual node. This way, it is always evident, at any
given time, what ballooned pages belong to what vnode.

Regarding the stealing a page from the OS part, it is enough to use the Linux
function alloc_page_node(), in place of alloc\_page().

Finally, from the hypercall point of view, both XENMEM\_decrease\_reservation
and XENMEM\_populate\_physmap already are NUMA-aware, so it is just a matter
of passing them the proper arguments.

### Usage examples ###

Let us assume we have a 4 vnodes guest on an 8 pnodes host. Virtual to
pysical topology mapping is as follows:

* vnode #0 --> pnode #1
* vnode #1 --> pnode #3
* vnode #2 --> pnode #3
* vnode #3 --> pnode #5

Let us also assume that the user has just decided that he want to change
the current memory allocation scheme, by ballooning (up or down) $DOMID.

Here they come some usage examples, for the NUMA-aware ballooning, with an
exemplified explanation of what happens in the various situations.

#### Ballooning down, example 1 ####

The user wants N free pages from pnode #3, and only from there:

1. user invokes `xl mem-set -n 3 -e  $DOMID N
2. libxl writes N to ~/memory/target and {3,true} to ~/memory/target_nid
3. ballooning driver asks Xen what vnodes insists on pnode #3, and gets
   {1,2}
4. ballooning driver tries to steal N pages from vnode #1 and/or vnode #3
   (both are fine) and add them to ballooned out pages (i.e., allocate
   them in guest and unmap them in Xen)
5. whether phase 4 succeeds or not, return, having potentially freed less
   than N pages on pnode #3

#### Ballooning down, example 2 ####

The user wants N free pages from pnode #1, but if impossible, is fine with
other pnodes to contribute to that:

1. user invokes `xl mem-set -n 1 $DOMID N
2. libxl writes N to ~/memory/target and {1,false} to ~/memory/target\_nid
3. ballooning driver asks Xen what vnodes insists on pnode #1, and gets {0}
4. ballooning driver tries to steal N pages from vnode #0 and add them to
   ballooned out pages
5. if less than N pages are found in phase 4, ballooning driver tries
   to steal from any vnode

#### Ballooning up, example 1 ####

The user wants to give N more pages, taking them from pnode #3, and only
from there:

1. user invokes `xl mem-set -n 3 -e $DOMID N
2. libxl writes N to ~/memory/target and {3,true} to ~/memory/target_nid
3. ballooning driver asks Xen what vnodes insists on pnode #3, and gets
   {1,2}
4. ballooning driver locates N ballooned pages belonging to vnode #1 or
   vnode #2 (both would do)
5. ballooning driver asks Xen to allocate N pages on pnode #3
6. ballooning driver ask Xen to map the new pages on the ballooned pages
   and hands the result back to the guest OS

NOTICE that, since node_exact was true, if phase 4 fails to find N pages
(say it finds Y<N) then phases 5 and 6 will work with Y

## Limitations ##

This features represents a performance (and resource usage) optimization.
As it is common in these cases, it does its best under certain conditions.
In case it is put to work under different conditions, it is possible that
its actual beneficial potential is diminished or, at worst, lost completely.
Nevertheless, it is required that everything keeps working and that the
performance at least do not drop below the level they where without the
feature itself.

In this case, the best or worst working conditions have to do with how well
the guest domain is placed on the host NUMA nodes, i.e., on how many and what
physical NUMA nodes its memory comes from. If the guest has a virtual NUMA
topology, they also have to do with the relationship between the virtual and
the physical topology.

It is possible to identify three situations:

1. multiple vnodes is backed by host memory from only one pnode (but not
   vice-versa, i.e., there is no single vnode backed by host memory from two
   or more pnodes). We can call this scenario many-to-one;
2. for all the vnodes, each one of them is backed by host memory coming from
   one specific (and all different among each others) pnode. We can call this
   scenario one-to-one;
3. there are vnodes that are backed by host memory coming at the same time
   from multiple pnodes. We can call this scenario one-to-many.

If a guest does not have a virtual NUMA topology, it can be seen as having
only one virtual NUMA node, accommodating all the memory.

Among the three scenarios above, 1 and 2 are fine, meaning that NUMA aware
ballooning will succeed in using host memory from the correct pnodes,
guaranteeing improved performance and better resource utilization than the
current situation (if there is enough free memory in the proper place, of
course).

The third situation (one-to-many) is the only problematic one. In fact, if
we try to get several pages from pnode #1, with the guest vnode #0 having
pages on both pnode #1 and pnode #2, the ballooning driver will pick up the
pages from the vnode #0's pool of ballooned pages, without any chance of
knowing whether they actually are backed by host pages from pnode #1 or
from pnode #2.

However, this just means that, in this case, NUMA-aware ballooning behaves
pretty much like the current (i.e., __non__ NUMA-aware ballooning) which is
certainly undesirable, but still acceptable. On a related note, having the 
memory of a vnode split among two (or more) pnodes is a non optimal situation
already, at least from a NUMA perspective, so it would be unrealistic to ask
the ballooning driver to do anything better than the above.

[numa_intro]: http://wiki.xen.org/wiki/Xen_NUMA_Introduction
[numa_placement]: http://xenbits.xen.org/docs/unstable/misc/xl-numa-placement.markdown
[xenstore]: http://xenbits.xen.org/docs/unstable/misc/xenstore.txt
[xenstore_paths]: http://xenbits.xen.org/docs/unstable/misc/xenstore-paths.markdown