Deep dive into virtio-iommu

5 min readApr 28, 2022

This article is a collection of questions on how virtio-iommu works and the answers I figured out.

To understand how virtio-iommu works, I tested with Cloud Hypervisor and Qemu on a AArch64 server.

Ask: How is a virtio device connected to virtio-iommu?

To be precise, when the question hit my mind for the first time, it was: If I have several virtio devices (assuming they are all virtio-pci devices), I hope to hide some of them beyond the virtio-iommu, some not. Then how should I tell the virtio-iommu to translate for some devices and not for others?

For a virtual machine monitor (VMM) of arm64 architecture, there are 2 ways to tell the hypervisor which device should be hidden by the virtio-iommu.

Using FDT

If your VM is booted from FDT and direct kernel, what you need to do with the FDT is to add:

iommu node: The node should describe the essential information of the virtio-iommu device:
- reg : the encoded BDF
- compitable : virtio,pci-iommu in our case, for the kernel to recognize what driver to apply
iommu_map data in pci node: An array. Each item of the array describes a mapping relation: an iommu device and the PCI device range it serves.

A kernel document introduces how iommu works with PCI in FDT.

For the ease of reading, I put a screenshot of an example from the kernel document. The comments of LINE 161~167 explains more about how iommu-map works.

Now you know that FDT doesn’t describe a peer-to-peer relationship between a virtio-pci device and the virtio-iommu. An iommu device serves a range of PCI devices. From the information of FDT only, the kernel cannot tell which PCI device should be hidden by iommu, which not. In the log of the VMM, you can see the kernel sends ATTACH requests to the iommu for all the PCI devices in the range.

But it doesn’t mean that all these PCI devices have to use the iommu, because no MAP request will be sent to the iommu for the devices that you didn’t connect. I suppose this (mapping or not) is determined by Linux kernel somehow, but didn’t figure out how it is done.

Using ACPI

If your VM boots from UEFI and use ACPI, the connection between the virtio-iommu and the PCI devices it serves is depicted by Virtual I/O Translation Table(VIOT) .

To be simple, in VIOT table you can specify the device ID (BDF) of the virtio-iommu, and the IDs of every PCI devices to be served.

Ask: When is a virtio-pci device attached to the virtio-iommu

Based on previous discussion of FDT and ACPI, the answer to this question is easy: Kernel already knows that a virtio-pci device need to sit behind a virtio-iommu device. So in the probe step of the virtio-pci driver, the kernel sends the MAP request to the virtio-iommu device in VMM.

Ask: How is the IOVA determined in a mapping request?

IOVA stands for I/O virtual address. After the mapping relation between the virtual address and physical address is established, driver can use this virtual address to perform DMA.

The physical address is allocated from guest IO space via the VMM. But how is the virtual address allocated?

Kernel function iommu_dma_alloc_iova() is a good point to start with if you want to look into the detail. This function takes the request size (and other parameters) and returns the allocated IOVA.

In my test, the request size was often 4KiB, and the IOVA was allocated from the end of 4GiB address space. So the first allocated IOVA was 0xFFFFF000, the second of the same domain was 0xFFFFE000……

Ask: How is the address translation performed?

For MMU, TLB hardware translates virtual address into physical address. For virtio-iommu, obviously there is not an equivalent hardware doing the job. The translation between the virtual address and physical address must be done by VMM software.

I will explain more detail about this with Cloud Hypervisor.

In Cloud Hypervisor the translation is done by AccessPlatform, which is a trait providing functions to do the translation between virtual address and physical address.

virtio-iommu is connected to virtio devices via the AccessPlatform. Specifically, things work this way:

Before creating every virtio devices, device-manager creates the virtio-iommu device at first. (code reference)
While creating a virtio device that is attached to the virtio-iommu, device-manager creates an AccessPlatform instance from the mapping data struct of the virtio-iommu. Of course now the mapping data is still empty. (code reference)
After a virtio-device is initialized, the mapping data begin to be updated each time a MAP or UNMAP request arrives at the virtio-iommu.
When the virtio device tries to access some IO address. It will check if it has an AccessPlatform instance or not. If yes, it means that the device itself is hidden beyond a virtio-iommu. So it will call the translate_gva function of the AccessPlatform to get the mapped physical address. With the physical address, the device then finish the IO access.

Something to notice regarding physical addresses: The “physical address” mentioned in the process described above is not the real physical address on the host, it is the physical address in the guest instead. A virtio-iommu device only performs the translation (between the virtual address and physical address) inside the virtual system it is in, it cannot see the physical address of the host machine (bare metal). That is why the functions of AccessPlatform trait are named with gva (guest virtual address) and gpa (guest physical address).