shithub: riscv

ref: a04fd342727c329584a9f443092caf21e3884bdf
dir: /sys/doc/fs/p4/

View raw version
.SH
Block Devices
.PP
The block device I/O system is like a
protocol stack of filters.
There are a set of pseudo-devices that call
recursively to other pseudo-devices and real devices.
The protocol stack is compiled from a configuration
string that specifies the order of pseudo-devices and devices.
Each pseudo-device and device has a set of entry points
that corresponds to the operations that the file system
requires of a device.
The most notable operations are
.CW read ,
.CW write ,
and
.CW size .
.PP
The device stack can best be described by
describing the syntax of the configuration string
that specifies the stack.
Configuration strings are used
during the setup of the file system.
For a description see
.I fsconfig (8).
In the following recursive definition,
.I D
represents a
string that specifies a block device.
.IP "\fID\fP = (\fIDD\fP...)"
.br
This is a set of devices that
are concatenated to form a single device.
The size of the catenated device is the
sum of the sizes of each sub-device.
.IP "\fID\fP = [\fIDD\fP...]"
.br
This is the interleaving of the
individual devices.
If there are N devices in the list,
then the pseudo-device is the N-way block
interleaving of the sub-devices.
The size of the interleaved device is
N times the size of the smallest sub-device.
.IP "\fID\fP = {\fIDD\fP...}"
.br
This is a set of devices that
constitute a `mirror' of the first sub-device, and form a single device.
A write to the device is performed,
at the same block address,
on the sub-devices, in right-to-left order.
A read from the device is performed on each sub-device,
in left-to-right order, until a read succeeds without error,
or the set is exhausted.
One can think of this as a poor man's RAID 1.
The size of the device is the size of the smallest sub-device.
.IP "\fID\fP = \f(CWp\fP\fIDN1.N2\fP"
.br
This is a partition of a sub-device.
The sub-device is partitioned into 100 equal pieces.
If the size of the sub-device is not divisible by 100,
then there will be some slop thrown away at the top.
The pseudo-device starts at the N1-th piece and
continues for N2 pieces. Thus
.CW p\fID\fP67.33
will be the
last third of the device
.I D .
.IP "\fID\fP = \f(CWf\fP\fID\fP"
.br
This is a fake write-once-read-many device simulated by a
second read-write device.
This second device is partitioned
into a set of block flags and a set of blocks.
The flags are used to generate errors if a
block is ever written twice or read without being written first.
.IP "\fID\fP = \f(CWx\fP\fID\fP"
.br
This is a byte-swapped version of the file system on D.
Since the file server currently writes integers in metadata to disk
in native byte order, moving a file system to a machine of the other
major byte order (e.g., MIPS to Pentium)
requires the use of
.CW x .
It knows the sizes of the various integer fields in the file system metadata.
Ideally, the file server would follow the Plan 9 religion and write a consistent
byte order on disk, regardless of processor.
In the mean time, it should be possible to automatically determine the need
for byte-swapping by examining data in the super-block of each file system,
though this has not been implemented yet.
.IP "\fID\fP = \f(CWc\fP\fIDD\fP"
.br
This is the cache/WORM device made up of a cache (read-write)
device and a WORM (write-once-read-many) device.
More on this later.
.IP "\fID\fP = \f(CWo\fP"
.br
This is the dump file system that is the
two-level hierarchy of all dumps ever taken on a cache/WORM.
The read-only root of the cache/WORM file system
(on the dump taken Feb 18, 1995) can
be referenced as
.CW /1995/0218
in this pseudo device.
The second dump taken that day will be
.CW /1995/02181 .
.IP "\fID\fP = \f(CWw\fP\fIN1.N2.N3\fP"
.br
This is a SCSI disk on controller N1, target N2 and logical unit number N3.
.IP "\fID\fP = \f(CWh\fP\fIN1.N2.0\fP"
.br
This is an (E)IDE or *ATA disk on controller N1, target N2
(target 0 is the IDE master, 1 the slave device).
These disks are currently run via programmed I/O, not DMA,
so they tend to be slower to access than SCSI disks.
.IP "\fID\fP = \f(CWr\fP\fIN1\fP"
.br
This is the same as
.CW w ,
but refers to a side of a WORM disc.
See the
.I j
device.
.IP "\fID\fP = \f(CWl\fP\fIN1\fP"
.br
This is the same as
.CW r ,
but one block from the SCSI disk is removed for labeling.
.IP "\fID\fP = \f(CWj(\fP\fID\d\s-2\&1\s+2\u\fID\d\s-2\&2\s+2\u\f(CW*)\fID\d\s-2\&3\s+2\u\f1"
.br
.I D\d\s-2\&1\s+2\u
is the juke box SCSI interface.
The
.I D\d\s-2\&2\s+2\u 's
are the SCSI drives in the juke box
and the
.I D\d\s-2\&3\s+2\u 's
are the demountable platters in the juke box.
.I D\d\s-2\&1\s+2\u
and
.I D\d\s-2\&2\s+2\u
must be
.CW w .
.I D\d\s-2\&3\s+2\u
must be pseudo devices of
.CW w ,
.CW r ,
or
.CW l
devices.
.PP
For
.CW w ,
.CW h ,
.CW l ,
and
.CW r
devices any of the configuration numbers
can be replaced by an iterator of the form
.CW <\fIN1-N2\fP> .
N1 can be greater than N2, indicating a descending sequence.
Thus
.Ex
	[w0.<2-6>]
.Ee
is the interleaved SCSI disks on SCSI targets
2 through 6 of SCSI controller 0.
The main file system on
Emelie
is defined by the configuration string
.Ex
	c[w1.<0-5>.0]j(w6w5w4w3w2)(l<0-236>l<238-474>)
.Ee
This is a cache/WORM driver.
The cache is three interleaved disks on SCSI controller 1
targets 0, 1, 2, 3, 4, and 5.
The WORM half of the cache/WORM
is 474 jukebox disks.
Another file server,
.I choline ,
has a main file system defined by
.Ex
	c[w<1-3>]j(w1.<6-0>.0)(l<0-124>l<128-252>)
.Ee
The order of
.CW w1.<6-0>.0
matters here, since the optical jukebox's WORM drives's
SCSI target ids,
as delivered,
run in descending order relative to the numbers of the drives
in SCSI commands
(e.g., the jukebox controller is SCSI target 6,
drive #1 is SCSI target 5,
and drive #6 is SCSI target 0).