kvm: hardware assisted paging
CPU vendors began adding hardware virtual memory management unit (vMMU) support circa 2009, with Intel’s VT-x (vmx flag) addition. Historically, the guest’s physical (gpa) to host physical (hpa) addresses where translated through software, using shadow page tables. These tables are kept synchronized with the guest’s page tables, and are one of the main sources of overhead in virtual machines, as they incur in expensive vm exits. A common way of keeping the shadow pages up to date are to write-protect the guest’s pages, so that when they are changed, page faults are triggered and intercepted by the VMM, which emulates it (injecting the page) and updating the shadow ones, accordingly. This, of course, is transparent to the guest. Another major problem, is that TLB semantics require flushes upon context switching, as newly assigned processes need to have it empty to cache entries only belonging to the process’s address space. To overcome this, CPUs now incorporate tags into the TLB – also known as vpid, which allow mapping that associate addresses to processes and thus reducing the amount of flushes.
With hardware vMMUs, in order to avoid the VMM overhead with shadow paging, the guest is left alone to update its page tables, while the hardware maintains its own page tables which maps gpa to hpa. Intel calls these Extended Page Tables (EPT). Having two page tables now requires that when a guest translates and address, two levels must be walked (sometimes referred to as 2D page walks). So hardware support can come at a greater cost for programs with bad locality and cache unfriendly, than its software equivalent. When a TLB miss occurs, and the guest does a page walk, for each hierarchical level, the entire EPT must be walked as well, to obtain the hpa. For 64bit guests, this is worse than 32bit ones, as the 64bit address space requires more levels (PML4, PDP, PD, PTE) of translation.
KVM’s implementation of EPT is quite unique and uses both the guest’s tables and the hardware’s to translate addresses. When a guest needs to translate virtual addresses to physical ones, the gva_to_gpa()function is called:
static gpa_t FNAME(gva_to_gpa)(struct kvm_vcpu *vcpu, gva_t vaddr, u32 access,
struct x86_exception *exception)
{
struct guest_walker walker;
gpa_t gpa = UNMAPPED_GVA;
int r;
r = FNAME(walk_addr)(&walker, vcpu, vaddr, access);
if (r) {
gpa = gfn_to_gpa(walker.gfn);
gpa |= vaddr & ~PAGE_MASK;
} else if (exception)
*exception = walker.fault;
return gpa;
}
If the guest’s walk fails and the gva-gpa mapping is not present, a page fault is raised, and tdp_page_fault() – two diminutional paging – is invoked through an EPT violation – handle_ept_violation() to translate gpa to hpa. A new page table entry is created and the shadow page code is reused through mmu_set_spte()and added to the beginning of the page list through pte_list_add(). This way, the next time the guest virtual address is accessed, it will already be in the guest’s pages and walk_addr() will be done successfully, and the gpa can be returned without further a due.
source : https://blog.stgolabs.net/2012/03/kvm-hardware-assisted-paging.html