Vector regfile
- 32 of them, v0 to v31
- Each is
VLEN
bits
- Each can be divided into several elements
- The max element width is
ELEN
- CSR
vsew
maps to SEW
(standard element width) controls their width dynamically
- CSR
vl
controls the number of elements to operate on for vector instructions
- Packing of shorter vector
- when
SEW
is smaller than ELEN
, multiple SEW will be packed into one ELEN
unit
- Following little-endian rule
ELEN
units are packed into VLEN
register also
- Following little-endiam rule
- Storage of longer vector
- If operand longer than
SEW
is needed, then
- Even-numbered vector register holds the even-numbered elements
- Odd-numbered vector register holds the odd-numberred elements
- WHY? the author said it’s designed to simplify data path alignment
- Register grouping via
vlmul
field
- Any operation applies on one vector register
n
will apply on all the other vector registers in the same group
- Depending on the
vlmul
, 00 = no group, 01 = v[n+16], 02 = v[n+8], 03 = v[n+4]
VLMAX
will change accordingly, by double, quarod, 8 times
- Reuse for floating-point regfile
- Floating-point registers reside at the LSB
FLEN
of the vector registers
- Lower precision floating-point types are NaN-boxed (MSB bits are set to
'b1
)
- Loading floating-point data doesn’t change the upper bits where
VLEN
> FLEN
Vector operation
- Masking
- Masked off elements are skipped, and will not generate exceptions
- Use the LSB of each SEW in
v0
as the masking bit
- 4 types in
vm[1:0]
: true, false, scalar, no-masking
- Vector load/store
- Unit-stride
- Starting address = base address in RS1 + imm offset
- Fault-first version
- Constant-stride
- Base in RS1, stride in RS2, by an unit of byte
- Indexed (scatter-gather)
- Base in RS1, indices in VRS2 (signed integers)
- If
SEW
is larger than XLEN
, LSB bits are taken
- For store, there is an extra unordered-indexed load, in contrast to ordered-indexed load
- JW: we can make unordered version to be higher performance
- Vector arithmetic
- Widening operation with
w
suffix puts 2xSEW wide destination into an even-odd vector register pair
- Merge: use the mask field to merge 2 source operands
- Narrowing: convert multi-width vector into single-width vector
- Reduction: scalar <= vector op scalar
- Matrix multiplication support
- Fused-multiply-add + reduction
- Mask
- Mask population count
- Find-first-set mask bit
viota
: count the number of elements in v0
that’s masked or un-masked
- Can be combined with scatter/gather instructions to perform vector compress/expand instructions
- They are still discussing if this should be able to take any registers, not only
v0
- Set-before-frst mask bit
- Permutation
- insert/extract
- slide
- gather
Exception
- CSR
progress
logs the element index that caused the trap
- With this, failed operations can resume from where exactly it traps exception
- Precise vs imprecise traps
- Precise trap: slow, but support debug
- Imprecise trap: fast, but possibly obscure error conditions
- Implementations can choose
Chip designer