view tools/blktap2/README @ 22848:6341fe0f4e5a

Added tag 4.1.0-rc2 for changeset 9dca60d88c63
author Keir Fraser <>
date Tue Jan 25 14:06:55 2011 +0000 (2011-01-25)
parents 9f4c5734e4aa
line source
1 Blktap2 Userspace Tools + Library
2 ================================
4 Dutch Meyer
5 4th June 2009
7 Andrew Warfield and Julian Chesterfield
8 16th June 2006
11 The blktap2 userspace toolkit provides a user-level disk I/O
12 interface. The blktap2 mechanism involves a kernel driver that acts
13 similarly to the existing Xen/Linux blkback driver, and a set of
14 associated user-level libraries. Using these tools, blktap2 allows
15 virtual block devices presented to VMs to be implemented in userspace
16 and to be backed by raw partitions, files, network, etc.
18 The key benefit of blktap2 is that it makes it easy and fast to write
19 arbitrary block backends, and that these user-level backends actually
20 perform very well. Specifically:
22 - Metadata disk formats such as Copy-on-Write, encrypted disks, sparse
23 formats and other compression features can be easily implemented.
25 - Accessing file-based images from userspace avoids problems related
26 to flushing dirty pages which are present in the Linux loopback
27 driver. (Specifically, doing a large number of writes to an
28 NFS-backed image don't result in the OOM killer going berserk.)
30 - Per-disk handler processes enable easier userspace policing of block
31 resources, and process-granularity QoS techniques (disk scheduling
32 and related tools) may be trivially applied to block devices.
34 - It's very easy to take advantage of userspace facilities such as
35 networking libraries, compression utilities, peer-to-peer
36 file-sharing systems and so on to build more complex block backends.
38 - Crashes are contained -- incremental development/debugging is very
39 fast.
41 How it works (in one paragraph):
43 Working in conjunction with the kernel blktap2 driver, all disk I/O
44 requests from VMs are passed to the userspace deamon (using a shared
45 memory interface) through a character device. Each active disk is
46 mapped to an individual device node, allowing per-disk processes to
47 implement individual block devices where desired. The userspace
48 drivers are implemented using asynchronous (Linux libaio),
49 O_DIRECT-based calls to preserve the unbuffered, batched and
50 asynchronous request dispatch achieved with the existing blkback
51 code. We provide a simple, asynchronous virtual disk interface that
52 makes it quite easy to add new disk implementations.
54 As of June 2009 the current supported disk formats are:
56 - Raw Images (both on partitions and in image files)
57 - Fast sharable RAM disk between VMs (requires some form of
58 cluster-based filesystem support e.g. OCFS2 in the guest kernel)
59 - VHD, including snapshots and sparse images
60 - Qcow, including snapshots and sparse images
63 Build and Installation Instructions
64 ===================================
66 Make to configure the blktap2 backend driver in your dom0 kernel. It
67 will inter-operate with the existing backend and frontend drivers. It
68 will also cohabitate with the original blktap driver. However, some
69 formats (currently aio and qcow) will default to their blktap2
70 versions when specified in a vm configuration file.
72 To build the tools separately, "make && make install" in
73 tools/blktap2.
76 Using the Tools
77 ===============
79 Preparing an image for boot:
81 The userspace disk agent is configured to start automatically via xend
83 Customize the VM config file to use the 'tap:tapdisk' handler,
84 followed by the driver type. e.g. for a raw image such as a file or
85 partition:
87 disk = ['tap:tapdisk:aio:<FILENAME>,sda1,w']
89 Alternatively, the vhd-util tool (installed with make install, or in
90 /blktap2/vhd) can be used to build sparse copy-on-write vhd images.
92 For example, to build a sparse image -
93 vhd-util create -n MyVHDFile -s 1024
95 This creates a sparse 1GB file named "MyVHDFile" that can be mounted
96 and populated with data.
98 One can also base the image on a raw file -
99 vhd-util snapshot -n MyVHDFile -p SomeRawFile -m
101 This creates a sparse VHD file named "MyVHDFile" using "SomeRawFile"
102 as a parent image. Copy-on-write semantics ensure that writes will be
103 stored in "MyVHDFile" while reads will be directed to the most
104 recently written version of the data, either in "MyVHDFile" or
105 "SomeRawFile" as is appropriate. Other options exist as well, consult
106 the vhd-util application for the complete set of VHD tools.
108 VHD files can be mounted automatically in a guest similarly to the
109 above AIO example simply by specifying the vhd driver.
111 disk = ['tap:tapdisk:vhd:<VHD FILENAME>,sda1,w']
114 Snapshots:
116 Pausing a guest will also plug the corresponding IO queue for blktap2
117 devices and stop blktap2 drivers. This can be used to implement a
118 safe live snapshot of qcow and vhd disks. An example script "xmsnap"
119 is shown in the tools/blktap2/drivers directory. This script will
120 perform a live snapshot of a qcow disk. VHD files can use the
121 "vhd-util snapshot" tool discussed above. If this snapshot command is
122 applied to a raw file mounted with tap:tapdisk:AIO, include the -m
123 flag and the driver will be reloaded as VHD. If applied to an already
124 mounted VHD file, omit the -m flag.
127 Mounting images in Dom0 using the blktap2 driver
128 ===============================================
129 Tap (and blkback) disks are also mountable in Dom0 without requiring an
130 active VM to attach.
132 The syntax is -
133 tapdisk2 -n <type>:<full path to file>
135 For example -
136 tapdisk2 -n aio:/home/images/rawFile.img
138 When successful the location of the new device will be provided by
139 tapdisk2 to stdout and tapdisk2 will terminate. From that point
140 forward control of the device is provided through sysfs in the
141 directory-
143 /sys/class/blktap2/blktap#/
145 Where # is a blktap2 device number present in the path that tapdisk2
146 printed before terminating. The sysfs interface is largely intuitive,
147 for example, to remove tap device 0 one would-
149 echo 1 > /sys/class/blktap2/blktap0/remove
151 Similarly, a pause control is available, which is can be used to plug
152 the request queue of a live running guest.
154 Previous versions of blktap mounted devices in dom0 by using blkfront
155 in dom0 and the xm block-attach command. This approach is still
156 available, though slightly more cumbersome.
159 Tapdisk Development
160 ===============================================
162 People regularly ask how to develop their own tapdisk drivers, and
163 while it has not yet been well documented, the process is relatively
164 easy. Here I will provide a brief overview. The best reference, of
165 course, comes from the existing drivers. Specifically,
166 blktap2/drivers/block-ram.c and blktap2/drivers/block-aio.c provide
167 the clearest examples of simple drivers.
170 Setup:
172 First you need to register your new driver with blktap. This is done
173 in disktypes.h. There are five things that you must do. To
174 demonstrate, I will create a disk called "mynewdisk", you can name
175 yours freely.
177 1) Forward declare an instance of struct tap_disk.
179 e.g. -
180 extern struct tap_disk tapdisk_mynewdisk;
182 2) Claim one of the unused disk type numbers, take care to observe the
183 MAX_DISK_TYPES macro, increasing the number if necessary.
185 e.g. -
186 #define DISK_TYPE_MYNEWDISK 10
188 3) Create an instance of disk_info_t. The bulk of this file contains examples of these.
190 e.g. -
191 static disk_info_t mynewdisk_disk = {
193 "My New Disk (mynewdisk)",
194 "mynewdisk",
195 0,
196 #ifdef TAPDISK
197 &tapdisk_mynewdisk,
198 #endif
199 };
201 A few words about what these mean. The first field must be the disk
202 type number you claimed in step (2). The second field is a string
203 describing your disk, and may contain any relevant info. The third
204 field is the name of your disk as will be used by the tapdisk2 utility
205 and xend (for example tapdisk2 -n mynewdisk:/path/to/disk.image, or in
206 your xm create config file). The forth is binary and determines
207 whether you will have one instance of your driver, or many. Here, a 1
208 means that your driver is a singleton and will coordinate access to
209 any number of tap devices. 0 is more common, meaning that you will
210 have one driver for each device that is created. The final field
211 should contain a reference to the struct tap_disk you created in step
212 (1).
214 4) Add a reference to your disk info structure (from step (3)) to the
215 dtypes array. Take care here - you need to place it in the position
216 corresponding to the device type number you claimed in step (2). So
217 we would place &mynewdisk_disk in dtypes[10]. Look at the other
218 devices in this array and pad with "&null_disk," as necessary.
220 5) Modify the xend python scripts. You need to add your disk name to
221 the list of disks that xend recognizes.
223 edit:
224 tools/python/xen/xend/server/
226 And add your disk to the "blktap_disk_types" array near the top of
227 your file. Use the same name you specified in the third field of step
228 (3). The order of this list is not important.
231 Now your driver is ready to be written. Create a block-mynewdisk.c in
232 tools/blktap2/drivers and add it to the Makefile.
235 Development:
237 Copying block-aio.c and block-ram.c would be a good place to start.
238 Read those files as you go through this, I will be assisting by
239 commenting on a few useful functions and structures.
241 struct tap_disk:
243 Remember the forward declaration in step (1) of the setup phase above?
244 Now is the time to make that structure a reality. This structure
245 contains a list of function pointers for all the routines that will be
246 asked of your driver. Currently the required functions are open,
247 close, read, write, get_parent_id, validate_parent, and debug.
249 e.g. -
250 struct tap_disk tapdisk_mynewdisk = {
251 .disk_type = "tapdisk_mynewdisk",
252 .flags = 0,
253 .private_data_size = sizeof(struct tdmynewdisk_state),
254 .td_open = tdmynewdisk_open,
255 ....
257 The private_data_size field is used to provide a structure to store
258 the state of your device. It is very likely that you will want
259 something here, but you are free to design whatever structure you
260 want. Blktap will allocate this space for you, you just need to tell
261 it how much space you want.
264 tdmynewdisk_open:
266 This is the open routine. The first argument is a structure
267 representing your driver. Two fields in this array are
268 interesting.
270 driver->data will contain a block of memory of the size your requested
271 in in the .private_data_size field of your struct tap_disk (above).
273 driver->info contains a structure that details information about your
274 disk. You need to fill this out. By convention this is done with a
275 _get_image_info() function. Assign a size (the total number of
276 sectors), sector_size (the size of each sector in bytes, and set
277 driver->info->info to 0.
279 The second parameter contains the name that was specified in the
280 creation of your device, either through xend, or on the command line
281 with tapdisk2. Usually this specifies a file that you will open in
282 this routine. The final parameter, flags, contains one of a number of
283 flags specified in tapdisk.h that may change the way you treat the
284 disk.
287 _queue_read/write:
289 These are your read and write operations. What you do here will
290 depend on your disk, but you should do exactly one of-
292 1) call td_complete_request with either error or success code.
294 2) Call td_forward_request, which will forward the request to the next
295 driver in the stack.
297 3) Queue the request for asynchronous processing with
298 td_prep_read/write. In doing so, you will also register a callback
299 for request completion. When the request completes you must do one of
300 options (1) or (2) above. Finally, call td_queue_tiocb to submit the
301 request to a wait queue.
303 The above functions are defined in tapdisk-interface.c. If you don't
304 use them as specified you will run into problems as your driver will
305 fail to inform blktap of the state of requests that have been
306 submitted. Blktap keeps track of all requests and does not like losing track.
309 _close, _get_parent_id, _validate_parent:
311 These last few tend to be very routine. _close is called when the
312 device is closed, and also when it is paused (in this case, open will
313 also be called later). The other functions are used in stacking
314 drivers. Most often drivers will return TD_NO_PARENT and -EINVAL,
315 respectively.