[title] [quotes] We have seen a lot of these competing claims about the security of virtualization. So who's right? Are containers as secure as virtual machines? Are they not as secure now, but catching up quickly? Are both virtual machines and containers so insecure that it's not really worth considering the difference? I'm going to try to equip you with the tools to be able to make your own judgements on these kind of questions. I'm going to talk specifically about three different free software virtualization tecnologies -- Xen, KVM, and Linux containers -- and compare the risk that someone will be able to break out of one VM into another, or out of one container into another. I should say that much of the legwork for this talk was done by my colleague at Citrix, George Dunlap, and the entertaining zombie analogy is his. The first thing to talk about is "security" and "risk". When we say a system is "secure", we risk falling into the trap of thinking that security is binary: a system which is "insecure" can be broken into, and a system which is "secure" can't be broken into. Talking about "risk" instead makes it clearer that it's a spectrum: a "secure" system is a system which has a relatively low risk of being broken into. What is the nature of this risk? Where does it come from? In our scenario we have trust domains -- virtual machines, or containers; and these have been separated by a virtualization layer -- either a hypervisor or Linux containers. We're trying to evaluate the risk of someone bypassing the virtualization layer and accessing data or resources in other VMs or other containers. The source of this kind of risk in software is vulnerabilities. A *vulnerability* is the weakness -- a bug somewhere in the interface, or in the configuration, which an attacker that has control within one trust domain takes advantage of, to do things within another trust domain which they're not allowed. The code or technique that an attacker actually uses to take advantage of a vulnerability is called an exploit. If there is a vulnerability, and the attacker knows it, then the attacker can get into your system; if there is no vulnerability, or the attacker does not know it, then the attacker is foiled. So this virtual break-in requires the existence of a vulnerability, and the right combination of luck and knowledge on the part of the attacker. Vulnerabilities are key. A vulnerability in software is a mistake. This could be a mistake either in the code itself, or in the configuration. I'm going to focus mostly on software vulnerabilities in this talk, since most of the systems I'm talking about have at least /tried/ to make configuration vulnerabilities less easy to make. Let's consider a couple of examples of vulnerabilities: [heartbleed] Many of you have probably seen this cartoon already. This is a fairly simple vulnerability, which arises from a design error in the TLS protocol (which has two length fields in its heartbeat packets) and a programming error in OpenSSL (the dominant implementation). The risk we're interested in is the *probability* that an attacker knows of an exploitable vulnerability. I think it might be useful to introduce an analogy. [zombie rules] You and your motley crew are the last remnants of humanity, as far as you know. You're going from place to place, living on the remnants of the old civilization. You stay in one place until you use up all the resources in that place, then you move on. So all you have to do is make sure that every door, window, or opening of any kind is properly closed / boarded over before nightfall, and you're perfectly safe. But, if you leave a single crack open, and the zombies find it, then that's the end of the story. Now, it's not that hard to secure any given door or window. But, you're just human. You're often tired, or stressed, or in a hurry. Not everyone is really the best carpenter; you don't get to choose who survives with you in the zombie apocalypse. Despite your best efforts, you have occasionally woken up in the morning to find doors or windows that weren't properly secured; by luck, the zombies didn't find it. What kind of building are your party going to be looking for as a base and night shelter ? Well, every door and window is an opportunity to make a mistake. Every mistake is a roll of the dice - do the zombies find it before you do ? The survival of the human race depends on having few mistakes - and on being lucky with the mistakes you do make. For small probabilities, the risk of having a single open door or window scales linearly with the number of openings that need to be secured. So you're going to be looking for a building with few windows. Secondly, you'll be looking for a building with small, simple doors and windows that are easy to secure. The bigger and fiddlier the doors and windows, the more difficult each one is; you're not all carpenters, so the larger the windows, the more risk that when trying to secure it, something will get screwed up. Finally, if possible, you'll be looking for a building that allows you to have multiple layers of protection. This is sometimes called defense-in-depth. If you can secure the fence around the building, and the building itself, then to eat you the zombies need *two* mistakes to get through. If they get through the fence, but not into the building, you can clear out the grounds during the day and repair the fence. The chance of having two unsecured doors is exponentially smaller than the chances of having just one. How does this analogy apply to software? Every element of every interface is an opportunity to make a mistake. The risk we're concerned about is the risk of a vulnerability in the interface to the virtualization software. And every corner of that interface -- every argument to every function, every register, operand, every interaction of internal state -- is an opportunity for a mistake that will allow an attacker to break through that interface. This is what is called "attack surface". And now we get to my key point. The fundamental difference between a hypervisor and a container is the interface that the virtualization layer provides. A hypervisor provides an interface similar to that of hardware. So it will give you memory and pagetables, but you have to make your own operating systems. It will give you a disk with block read and write, but you have to make your own filesystem from it. It will give you a network card that can send packets, but you have to make your own TCP streams. This interface has a fair amount of opportunity to screw up; and as we'll see later, there have been a number of screw-ups. Both KVM and Xen do a lot of instruction emulation in the hypervisor: and the x86 instruction set is not really that simple. (ARM is much better, and anyway because of ARM's better design we don't need to emulate to do virtualisation on ARM.) Xen has about 40 hypercalls; most of which are not useable by "normal" VMs, but they're there. But compare this to the interface you get from the Linux kernel. A kernel gives you filesystems with files, directories, seek, fstat, read, mmio, and aio. How many different kinds of sockets can you create ? How many different IPC mechanisms are there ? How many ways to read and write to a file or network stream ? All of them with lots of internal state and corner cases which have to be handled correctly. There are just a *lot* more opportunities to make a mistake in the Linux system call interface; and as we'll see, a lot more mistakes are made on a regular basis as a result. OK, well, that's a theory; but as we know, it's often the case that theories turn out to be wrong. Do we have any actual evidence that the Linux system call interface is less secure than a hypervisor interface ? An in any case, does the difference in security matter ? Maybe Linux has really good security for average people, and hypervisors have a crazy level of security for people like the NSA. Or, maybe, like Theo de Raadt said they're both so insecure that it doesn't really matter which one you pick. George went through the CVE vulnerabilities for the last year for each of four projects: Xen, KVM, qemu, and Linux. For each one he and I considered: if we were running a system configured to be reasonably secure, would this vulnerability have affected us, and if so how ? [ Table ] I need to explain a bit about what I am comparing here. To make the figures comparable I had to pick a configuration and usage pattern for each of the systems. I assumed we were using an Intel x86 CPU. I assumed that we were running a general purpose operating system as the guest, and I have assumed that our enemy has already gained control of the guest. I have counted only vulnerabilities which would gain the attacker control of the whole system, or the corresponding denial of service or information leak bugs. So these are the vulnerabilities which, for example, a cloud hosting provider would worry about. For other scenarios you would get different numbers. But KVM, Xen and Linux countainers are all touted as approaches to solve this problem. This, along with some other more detailed assumptions, allows me to compare the number of vulnerabilities in three scenarios where each system is providing similar functionality. You can see that I have provided ranges for some of the entries in the table. This is because it is sometimes difficult to tell how bad a particular vulnerability is, and whether it is exploitable in a particular configuration. I'm going to look at some of these vulnerabilities in more detail. Let's start with one of the Xen ones. The juiciest one is XSA-87, CVE-2014-1666, from January last year. There are two hypercall sub-operations called PHYSDEVOP_prepare_... and ..._release_msix. They are supposedly for the privileged system software to do interrupt setup. However, they were missing the privilege checks. As a result any PV guest could mess about with the interrupt routing. The effect of this might vary but generally, the host would crash. (I haven't counted this as a privilege escalation because the XSA-87 advisory says that this is probably not possible in this case.) You can see that this is a very simple mistake. Now moving across the table, let's look at one of the KVM/QEMU bugs. CVE-2014-8106 titled "cirrus: insufficient blit region checks". This was a missing range check. A guest which exploits this can write to arbitrary memory in qemu, and that means they can get qemu to execute arbitrary code. This is a problem for KVM because KVM uses QEMU for the VGA emulation. Again, a very simple mistake. But there's something more interesting going on here: why is there a VGA emulation, anyway ? Well, KVM and QEMU provide their guest with an emulated PC. PCs have graphics cards, so there is an emulated graphics card. QEMU contains various emulated graphics cards, but this particular one - the Cirrus emulation - is the default, and used in most installations. Xen PV guests do not get an emulated Cirrus graphics card. This shows the advantages of a small attack surface. If you don't provide a feature, there can't be any bugs in it. It also shows the importance of selecting the best configuration. If we had set up our Xen guest as HVM, it would also have an emulated PC. That emulated PC has indeed got an emulated Cirrus graphics card, provided by QEMU - just as with KVM. So Xen HVM setups are vulnerable to this bug. I haven't done the detailed analysis, but I would expect a Xen HVM system (when configured not to use the privilege-separated stub domain qemu) to have numbers similar to those we see for KVM+QEMU. But what about Linux ? That's what we came here for, right - to find out the truth behind the container hype. These numbers do look bad. Let's pick the top one of off the list I got from George. CVE-2014-9322, privilege escalation due to SS fault. This is a complicated bug to do with the handling of a particular processor exception during an interrupt return instruction. It's a privilege escalation from any Linux process into the kernel and doesn't even need any particular syscalls to be enabled. It looks to me like this bug is due to some very complicated (not to say bizarre) rules in the way Intel CPUs handle transitions between kernel and user mode - I confess I'm not an Intel x86 expert so that's the best I can do as an explanation. But never mind. If I can't explain this one to you there are plenty more. Let's pick another that looks juicy and easier. CVE-2014-3153 "FUTEX_RUNQUEUE doesn't ensure different futexes". A futex is a "fast mutex", an inter-thread synchronisation facility provided by (and specific to) Linux. It's primarily used by the standard pthreads threading library. The kernel's implementation of one of the futex operations which operates on two caller-specified futexes failed to check that the two futexes were different. If they weren't, Linux would get its data structure tangled up in a way that allows for exploitation by later futex calls. KVM and Xen are not vulnerable to this specific vulnerability, of course, because they don't provide futexes to their guests. But that's rather missing the point because Xen and KVM /do/ provide inter-thread synchronisation facilities. But they provide very few such facilities each: In the case of Xen PV guests, there's just Xen's event channels. Linux provides (depending how you count them) many different inter-thread or inter-process synchronisation facilities. I could think of eight straight off: futexes, fcntl and flock locks, ttys, System V semaphores and message queues, pipes (named and anonymous), AF_UNIX sockets, signals, and inotify. Worse, these facilities can be exercised through a bewildering array of APIs all of which have grown as the operating system has come to be a convenient library of useful functionality. Indeed we can see this richness - and the corresponding risk - in this very example vulnerability. In the first submitted patch to fix the bug, the code compared the two "uaddr" values passed into the function. That is, the user process addresses for the two futexes. But Linux supports futexes which are shared between multiple processes - in which case the same futex can have different addresses in different processes, and indeed be mapped more than once into the same process at different addresses. So the first draft of the patch didn't properly fix the bug. This problem with the initial bugfix was itself fixed during the patch review. It goes to show how much more opportunity there can be for error with a complicated API! Of course we need to take these numbers with a pinch of salt. They're quite small so there will statistical variation. Whether vulnerabilities are reported depends not only on whether they're there, but also on whether (and how hard) people are looking. Speaking about the Xen project, we're aware of a few teams' efforts in the last few years to do some fuzzing, auditing and static analysis, but there's no central tracking of this kind of activity. [ Table with non-root column ] Finally, I need to point out that my comparison has been of three general-purpose containment approaches: they all provide, in the guest, something that looks much like a normal operating system. If your just want to run a specific application, then it is possible to restrict Linux containers. For example, when using Docker, it's not recommended not to give your applications root inside the container. Then, you get numbers for containers which look much more like those of a real virtualisation system. Of course the guest environment has much more limited functionality - you can't run a general purpose operating system with the full set of management tools and so on - but if you don't need all that then you are much better off restricting the facilities available to the guest. That reduces the attack surface. But the interface provided even to a non-root container guest is still much richer than that provided to a PV Xen VM, and that's reflected in these figures. So to summarise: The biggest enemy of security is complexity, and particularly API complexity at security-relevant boundaries. The smaller and simpler the API between your trusted and untrusted components, the better. Linux container systems can have excellent management and convenience benefits. But I conclude that they shouldn't be thought of as a tool with secure encapsulation as a primary benefit. If I was providing a cloud service for containerised services, I would want to run them in proper VMs (perhaps one VM per customer) to provide more dependable isolation.