[title]

[quotes]

We have seen a lot of these competing claims about the security of
virtualization.

So who's right?  Are containers as secure as virtual machines?  Are
they not as secure now, but catching up quickly?  Are both virtual
machines and containers so insecure that it's not really worth
considering the difference?

I'm going to try to equip you with the tools to be able to make your
own judgements on these kind of questions.

I'm going to talk specifically about three different free software
virtualization tecnologies -- Xen, KVM, and Linux containers -- and
compare the risk that someone will be able to break out of one VM
into another, or out of one container into another.

I should say that much of the legwork for this talk was done by my
colleague at Citrix, George Dunlap, and the entertaining zombie
analogy is his.


The first thing to talk about is "security" and "risk".

When we say a system is "secure", we risk falling into the trap of
thinking that security is binary: a system which is "insecure" can be
broken into, and a system which is "secure" can't be broken into.

Talking about "risk" instead makes it clearer that it's a spectrum: a
"secure" system is a system which has a relatively low risk of being
broken into.


What is the nature of this risk?  Where does it come from?

In our scenario we have trust domains -- virtual machines, or
containers; and these have been separated by a virtualization layer --
either a hypervisor or Linux containers.  We're trying to evaluate the
risk of someone bypassing the virtualization layer and accessing data
or resources in other VMs or other containers.

The source of this kind of risk in software is vulnerabilities.  A
*vulnerability* is the weakness -- a bug somewhere in the interface,
or in the configuration, which an attacker that has control within one
trust domain takes advantage of, to do things within another trust
domain which they're not allowed.

The code or technique that an attacker actually uses to take advantage
of a vulnerability is called an exploit.  If there is a vulnerability,
and the attacker knows it, then the attacker can get into your system;
if there is no vulnerability, or the attacker does not know it, then
the attacker is foiled.

So this virtual break-in requires the existence of a vulnerability,
and the right combination of luck and knowledge on the part of the
attacker.

Vulnerabilities are key.  A vulnerability in software is a mistake.
This could be a mistake either in the code itself, or in the
configuration.  I'm going to focus mostly on software vulnerabilities
in this talk, since most of the systems I'm talking about have at
least /tried/ to make configuration vulnerabilities less easy to make.


Let's consider a couple of examples of vulnerabilities:

[heartbleed]

Many of you have probably seen this cartoon already.  This is a fairly
simple vulnerability, which arises from a design error in the TLS
protocol (which has two length fields in its heartbeat packets) and a
programming error in OpenSSL (the dominant implementation).


The risk we're interested in is the *probability* that an attacker
knows of an exploitable vulnerability.

I think it might be useful to introduce an analogy.


[zombie rules]

You and your motley crew are the last remnants of humanity, as far as
you know.  You're going from place to place, living on the remnants of
the old civilization.  You stay in one place until you use up all the
resources in that place, then you move on.

So all you have to do is make sure that every door, window, or opening
of any kind is properly closed / boarded over before nightfall, and
you're perfectly safe.  But, if you leave a single crack open, and the
zombies find it, then that's the end of the story.

Now, it's not that hard to secure any given door or window.  But,
you're just human.  You're often tired, or stressed, or in a hurry.
Not everyone is really the best carpenter; you don't get to choose who
survives with you in the zombie apocalypse.  Despite your best
efforts, you have occasionally woken up in the morning to find doors
or windows that weren't properly secured; by luck, the zombies didn't
find it.

What kind of building are your party going to be looking for as a base
and night shelter ?

Well, every door and window is an opportunity to make a mistake.
Every mistake is a roll of the dice - do the zombies find it before
you do ?  The survival of the human race depends on having few
mistakes - and on being lucky with the mistakes you do make.

For small probabilities, the risk of having a single open door or
window scales linearly with the number of openings that need to be
secured.  So you're going to be looking for a building with few
windows.

Secondly, you'll be looking for a building with small, simple doors
and windows that are easy to secure.  The bigger and fiddlier the
doors and windows, the more difficult each one is; you're not all
carpenters, so the larger the windows, the more risk that when trying
to secure it, something will get screwed up.

Finally, if possible, you'll be looking for a building that allows you
to have multiple layers of protection.  This is sometimes called
defense-in-depth.  If you can secure the fence around the building,
and the building itself, then to eat you the zombies need *two*
mistakes to get through.  If they get through the fence, but not into
the building, you can clear out the grounds during the day and repair
the fence.

The chance of having two unsecured doors is exponentially smaller than
the chances of having just one.


How does this analogy apply to software?


Every element of every interface is an opportunity to make a mistake.

The risk we're concerned about is the risk of a vulnerability in the
interface to the virtualization software.  And every corner of that
interface -- every argument to every function, every register,
operand, every interaction of internal state -- is an opportunity for
a mistake that will allow an attacker to break through that interface.

This is what is called "attack surface".  

And now we get to my key point.  The fundamental difference between a
hypervisor and a container is the interface that the virtualization
layer provides.

A hypervisor provides an interface similar to that of hardware.  So it
will give you memory and pagetables, but you have to make your own
operating systems.  It will give you a disk with block read and write,
but you have to make your own filesystem from it.  It will give you a
network card that can send packets, but you have to make your own TCP
streams.

This interface has a fair amount of opportunity to screw up; and as
we'll see later, there have been a number of screw-ups.  Both KVM and
Xen do a lot of instruction emulation in the hypervisor: and the x86
instruction set is not really that simple.  (ARM is much better, and
anyway because of ARM's better design we don't need to emulate to do
virtualisation on ARM.)  Xen has about 40 hypercalls; most of which
are not useable by "normal" VMs, but they're there.

But compare this to the interface you get from the Linux kernel.  A
kernel gives you filesystems with files, directories, seek, fstat,
read, mmio, and aio.  How many different kinds of sockets can you
create ?  How many different IPC mechanisms are there ?  How many ways
to read and write to a file or network stream ?  All of them with lots
of internal state and corner cases which have to be handled correctly.

There are just a *lot* more opportunities to make a mistake in the
Linux system call interface; and as we'll see, a lot more mistakes are
made on a regular basis as a result.


OK, well, that's a theory; but as we know, it's often the case that
theories turn out to be wrong.  Do we have any actual evidence that
the Linux system call interface is less secure than a hypervisor
interface ?  An in any case, does the difference in security matter ?
Maybe Linux has really good security for average people, and
hypervisors have a crazy level of security for people like the NSA.
Or, maybe, like Theo de Raadt said they're both so insecure that it
doesn't really matter which one you pick.

George went through the CVE vulnerabilities for the last year for each
of four projects: Xen, KVM, qemu, and Linux.

For each one he and I considered: if we were running a system
configured to be reasonably secure, would this vulnerability have
affected us, and if so how ?

[ Table ]

I need to explain a bit about what I am comparing here.

To make the figures comparable I had to pick a configuration and usage
pattern for each of the systems.  I assumed we were using an Intel x86
CPU.  I assumed that we were running a general purpose operating
system as the guest, and I have assumed that our enemy has already
gained control of the guest.

I have counted only vulnerabilities which would gain the attacker
control of the whole system, or the corresponding denial of service or
information leak bugs.

So these are the vulnerabilities which, for example, a cloud hosting
provider would worry about.  For other scenarios you would get
different numbers.  But KVM, Xen and Linux countainers are all touted
as approaches to solve this problem.

This, along with some other more detailed assumptions, allows me to
compare the number of vulnerabilities in three scenarios where each
system is providing similar functionality.

You can see that I have provided ranges for some of the entries in the
table.  This is because it is sometimes difficult to tell how bad a
particular vulnerability is, and whether it is exploitable in a
particular configuration.


I'm going to look at some of these vulnerabilities in more detail.

Let's start with one of the Xen ones.  The juiciest one is XSA-87,
CVE-2014-1666, from January last year.  There are two hypercall
sub-operations called PHYSDEVOP_prepare_... and ..._release_msix.
They are supposedly for the privileged system software to do interrupt
setup.  However, they were missing the privilege checks.  As a result
any PV guest could mess about with the interrupt routing.  The effect
of this might vary but generally, the host would crash.  (I haven't
counted this as a privilege escalation because the XSA-87 advisory
says that this is probably not possible in this case.)

You can see that this is a very simple mistake.


Now moving across the table, let's look at one of the KVM/QEMU bugs.
CVE-2014-8106 titled "cirrus: insufficient blit region checks".

This was a missing range check.  A guest which exploits this can write
to arbitrary memory in qemu, and that means they can get qemu to
execute arbitrary code.  This is a problem for KVM because KVM uses
QEMU for the VGA emulation.

Again, a very simple mistake.  But there's something more interesting
going on here: why is there a VGA emulation, anyway ?

Well, KVM and QEMU provide their guest with an emulated PC.  PCs have
graphics cards, so there is an emulated graphics card.  QEMU contains
various emulated graphics cards, but this particular one - the Cirrus
emulation - is the default, and used in most installations.  Xen PV
guests do not get an emulated Cirrus graphics card.

This shows the advantages of a small attack surface.  If you don't
provide a feature, there can't be any bugs in it.

It also shows the importance of selecting the best configuration.  If
we had set up our Xen guest as HVM, it would also have an emulated PC.
That emulated PC has indeed got an emulated Cirrus graphics card,
provided by QEMU - just as with KVM.

So Xen HVM setups are vulnerable to this bug.  I haven't done the
detailed analysis, but I would expect a Xen HVM system (when
configured not to use the privilege-separated stub domain qemu) to
have numbers similar to those we see for KVM+QEMU.


But what about Linux ?  That's what we came here for, right - to find
out the truth behind the container hype.  These numbers do look bad.

Let's pick the top one of off the list I got from George.
CVE-2014-9322, privilege escalation due to SS fault.  This is a
complicated bug to do with the handling of a particular processor
exception during an interrupt return instruction.  It's a privilege
escalation from any Linux process into the kernel and doesn't even
need any particular syscalls to be enabled.  It looks to me like this
bug is due to some very complicated (not to say bizarre) rules in the
way Intel CPUs handle transitions between kernel and user mode - I
confess I'm not an Intel x86 expert so that's the best I can do as an
explanation.

But never mind.  If I can't explain this one to you there are plenty
more.  Let's pick another that looks juicy and easier.  CVE-2014-3153
"FUTEX_RUNQUEUE doesn't ensure different futexes".

A futex is a "fast mutex", an inter-thread synchronisation facility
provided by (and specific to) Linux.  It's primarily used by the
standard pthreads threading library.  The kernel's implementation of
one of the futex operations which operates on two caller-specified
futexes failed to check that the two futexes were different.  If they
weren't, Linux would get its data structure tangled up in a way that
allows for exploitation by later futex calls.

KVM and Xen are not vulnerable to this specific vulnerability, of
course, because they don't provide futexes to their guests.  But
that's rather missing the point because Xen and KVM /do/ provide
inter-thread synchronisation facilities.  But they provide very few
such facilities each:

In the case of Xen PV guests, there's just Xen's event channels.

Linux provides (depending how you count them) many different
inter-thread or inter-process synchronisation facilities.  I could
think of eight straight off: futexes, fcntl and flock locks, ttys,
System V semaphores and message queues, pipes (named and anonymous),
AF_UNIX sockets, signals, and inotify.

Worse, these facilities can be exercised through a bewildering array
of APIs all of which have grown as the operating system has come to be
a convenient library of useful functionality.

Indeed we can see this richness - and the corresponding risk - in this
very example vulnerability.  In the first submitted patch to fix the
bug, the code compared the two "uaddr" values passed into the
function.  That is, the user process addresses for the two futexes.

But Linux supports futexes which are shared between multiple processes
- in which case the same futex can have different addresses in
different processes, and indeed be mapped more than once into the same
process at different addresses.  So the first draft of the patch
didn't properly fix the bug.  This problem with the initial bugfix was
itself fixed during the patch review.

It goes to show how much more opportunity there can be for error with
a complicated API!


Of course we need to take these numbers with a pinch of salt.  They're
quite small so there will statistical variation.  Whether
vulnerabilities are reported depends not only on whether they're
there, but also on whether (and how hard) people are looking.
Speaking about the Xen project, we're aware of a few teams' efforts in
the last few years to do some fuzzing, auditing and static analysis,
but there's no central tracking of this kind of activity.


[ Table with non-root column ]

Finally, I need to point out that my comparison has been of three
general-purpose containment approaches: they all provide, in the
guest, something that looks much like a normal operating system.

If your just want to run a specific application, then it is possible
to restrict Linux containers.  For example, when using Docker, it's
not recommended not to give your applications root inside the
container.

Then, you get numbers for containers which look much more like those
of a real virtualisation system.  Of course the guest environment has
much more limited functionality - you can't run a general purpose
operating system with the full set of management tools and so on - but
if you don't need all that then you are much better off restricting
the facilities available to the guest.  That reduces the attack
surface.

But the interface provided even to a non-root container guest is still
much richer than that provided to a PV Xen VM, and that's reflected in
these figures.


So to summarise:

The biggest enemy of security is complexity, and particularly API
complexity at security-relevant boundaries.  The smaller and simpler
the API between your trusted and untrusted components, the better.

Linux container systems can have excellent management and convenience
benefits.  But I conclude that they shouldn't be thought of as a tool
with secure encapsulation as a primary benefit.

If I was providing a cloud service for containerised services, I would
want to run them in proper VMs (perhaps one VM per customer) to
provide more dependable isolation.