How Does Virtualization Work?

I wrote recently about the keynote from the Linley processor conference about network virtualization, and an analogy with server virtualization. One question that I know from when I worked at VaST and Virtutech is that the whole idea of virtualization is poorly understood. It is good to learn a little about how it works. The first virtualized operating system called simply VM/370 (because it ran on the IBM 370 family of mainframes). It could run a number of other operating systems, what we would now call guest operating systems, on top of VM, which was what we would now call a hypervisor, although I don't think that word was used back then. And by "back then" I mean 1972. Computers looked like the ones in old spy movies, with spinning magnetic tape drives and chattering line-printers. The first thing to understand is how a "normal" IBM 370 operating system works. There is user code, the programs the user is running. But user code doesn't directly interact with devices, for example. It doesn't write into the device registers, wait for an interrupt, handle whatever then needs to be done. This wouldn't really be possible with a "normal" operating system and would risk screwing up other users. For example, if one application program turned off all interrupts and never turned them on again, then the whole machine would freeze. What happens if a user program tries to do something like that? It can't. The instruction set of the IBM 360 (and pretty much anything beyond a microcontroller) has two parts, a part for user programs and a part for the operating system known as the privileged instruction set. When a user program is running and the processor is in user mode, any attempt to run a privileged instruction causes an exception, and the operating system gets to take over and, typically, terminate (abend in IBM speak) the user program. So how does the processor ever get into privileged mode? There is one instruction, SVC, which stands for supervisor call, that is allowed in user programs. It puts the processor into privileged mode, but at the same time it transfers control to the operating system, which can look at the argument to the SVC (typically the address of a data structure describing what is being requested). The operating system then arranges for whatever was requested (read a block from a tape drive, say) to be done and then passes control back to the user program, at the same time putting the processor back into user mode. There are more details, but that's the rough idea. Under VM, the hypervisor has a number of operating systems to run, each running a number of programs (in user mode). So how do the guest operating systems run? They need to execute privileged instructions. Plus how does one guest operating system avoid interfering with another operating system, perhaps by trying to read a block off the same magnetic tape drive at the same time. After all, neither guest operating system knows the other exists, they are just binaries that could run on the bare hardware. One way to implement this would be to interpret the machine instructions one-by-one. But that would be an order of magnitude or two slower than the actual hardware was capable of, so a datacenter with a lot of virtualized servers would need ten times as many. Not a good solution. The trick is that the guest operating systems don't get to run in privileged mode. Only the hypervisor does. When a guest operating system tries to execute a privileged instruction, it causes an exception and the hypervisor gets to take over. But instead of shutting down the guest operating system, it looks at what the guest operating system was trying to do and does something similar. For example, if a guest operating system turns interrupts off, the hypervisor notes this but doesn't actually turn interrupts off. Any interrupts that should be passed to the guest operating system are instead queued up, and only delivered when the guest operating system turns interrupts on again. The basic idea is to "pretend" to behave like the real hardware would. If the guest operating system reads a block from a magnetic tape, the hypervisor reads it, puts it in memory where it should be, adjusts the control blocks in the right way, and then interrupts the guest operating system. To be really useful, this basic approach needs to be extended a bit. Otherwise the only way this approach works is to partition up the hardware so that OS1 gets line printer 1, and OS2 gets line printer 2. This is clearly not very workable since there might only be one line-printer. So the first optimization is to make every guest operating system apparently have its own line printer, but the VM hypervisor underneath actually spools everything to be printed and so can hide the fact that there is only one line-printer, printing each document when it is completely ready to be printed. If the guest operating system attempted to get a particular disk pack loaded (in those days, disk drives had exchangeable disks), the hypervisor would get the operator to load it, and then make it appear as if it was on one of the guest operating system's own disk drives. It would then translate requests to read from that guest operating system's drive into read requests to the actual physical hardware where the disk pack was installed. It's sort of like The Matrix . The guest operating system doesn't realize it is living in a simulation of the real world. The next step is to let the guest operating systems be aware that they are running on a virtual machine and allow them to communicate with the hypervisor. It turns out that the IBM 370 has one instruction that is never used in either operating systems or user programs, DIAG. This is only used to diagnose hardware errors in special test programs. So this is the wormhole that is used to get from guest operating system to the hypervisor. The guest operating system can then issue special requests that wouldn't make sense if it was running on the real hardware, such as suspending the entire guest machine for a period. VMWare doesn't run on IBM mainframes, of course, but rather on Intel (and AMD) x86 hardware. It turns out that Intel's architecture wasn't quite as clean for virtualization as IBM's. In particular, user programs could read (but not write) the CPU's control registers. On the IBM 370, if a guest operating system tried to read a register like that, it would generate an exception and the hypervisor could decide what the guest operating system "ought" to see and deliver it. On an x86, the instruction would just execute and deliver the real contents of the register. So on x86, the code needs to be scanned and re-written to ensure that the hypervisor always gets control when it should. This is not a complex re-writing since almost all instructions are simply copied, just a few instructions need to be passed to the hypervisor. The performance penalty is very low, since almost all code is running natively, executed directly by the hardware. This is how VMWare runs on big datacenters but it is also the technique that lets you run Windows on your Mac or vice versa. On the type of virtual platform used for SoC verification, the rewriting is taken a stage further. The code might be written for an ARM processor, so when rewriting the code it is not just copied, it is translated from ARM instructions into x86 instructions. This way the code will run fast since most of it is executing natively. In fact, since embedded ARM processors are typically much slower than x86 server processors, the code might even run faster. The same approach is used to execute Java byte codes, which no hardware can execute directly. This approach is know as just-in-time compilation, or just JIT compilation (and sometimes as dynamic translation). Previous: Sophia Antipolis