Cache coherence between CPUs are most explained in textbooks, but IO coherence is not well understood. Recently I’m involved in architecture discussion about IO coherence, and found this paper, “Maintaining I/O Data Coherence in Embedded Multicore Systems” by Thomas B. Berg 2009, very useful coming to explain what is IO coherence and how to implement it in embedded system.
I/O Coherence Producer-consumer model Most mechanisms for passing data between IO device and CPU, either CPU -> IO or IO -> CPU, use the classic producer-cosumer model.
AMBA (ARM Advanced Microcontroller Bus Architecture)
1. AXI AXI protocol is a point-to-point protocol So no matter what the network channels really use, as long as its ports comply AXI protocol, IP can be connected to them Main features Separate read/write channels Improve bandwidth Multiple outstanding requests No strict timing relationship between address and data phases Unaligned data transfer Out-of-order transaction completion Implicitly incremental address for burst transfer Transfer vs transaction Transfer is a handshake: valid/ready Transaction is composed of multiple transfers Active transaction: already started but not finished.
ARM online training note
1. Introduction What is an architecture? Instruction set Exception model Memory model Debug ARMv8 AArch32 vs AArch64 AArch32: backward compatible to ARMv7 AArch64: fixed 32-bit instruction, new exception model, 64-bit virtual address Priviledge and security model 4-level of privilege EL0 < EL1 < EL2 < EL3, larger the higher privilege 2 security modes Mixture of AArch32 and AArch64 Only 64-bit OS can host a mix of 32-bit and 64-bit apps 32-bit app can only be on lower EL level 2.
Reference
Interrupt Categorization Hardware vs. Software Hardware: usually caused by peripheral or other processors IRQ: maskable interrupt NMI: non-maskable interrupt For highest priority tasks, like times, especially wathdog timers Wathdog timer: a timer has to be reset by software on purpose periodically, otherwise it means the software has gone into some hanging situation and will trigger watchdog routine to recover or reboot. IPI: inter-processor interrupt Software: caused by exception or special instructions that used to implement system calls Interrupt vs inter-process communication signal Interrupt: mediated by the processor (hardware); handled by the kernel Signal: mediated by the kernel (through systeam call); handled by processes Such as: SIGSEGV, SIGBUS, SIGILL, SIGFPE Precise vs imprecise interrupt Precise interrupts has PC and other architecture states are saved, so after interrupt handler is done the current process can resume All instructions before the time point have fully executed, and no instructions beyond has been executed (or they are killed) Triggering methods Physical interrupt Level-triggered vs.
https://developer.arm.com/technologies/big-little
big.LITTLE is a practical example of SMP (Symmetric Multiprocessing). It combines high performance CPU cores and low power CPU cores in the same chip, connected using cache coherent interconnect, to achieve high peak performance within thermal bounds of the system when intense computational power is needed, as well as maximum energy efficiency when the device is in light usage mode most of the time. It’s a particular adaption to mobile devices usage.
There is a big difference between how I used to understand hardware security and state-of-the-art security supported by hardware software co-design, after I watched some video talking about SEP (Security Enclave Processor) by Apple. It’s a key component in current iPhone to protect user data and password from being observed in any kind of hacking, including traditional side channel attack such as DPA (Dynamic Power Attack), debug channel attack, normal network attack, and etc.
From the reading of this paper, “The Hwacha Microarchitecture Manual, Version 3.8.1”, I found out that our Pygmy ES1 architecture is almost the same idea, just not as fancy.
We don’t have cache coherency, because we operate on unified physical memory space The vector execution unit is not as fancy, just useless multithreading, no systolic The prefetch is supposed to be done by DMA engine that is manually controlled by software System architecture The vector accelerator only has L1 I$, no D$ Don’t need to maintain cache coherency Lots of vector registers, 512 in total, each is 64x2x4=512-bit Wide bus connection to L2$, to provide higher bandwidth Uncached TileLink between L1 I$ (both scalar processor and vector processors) and L2$ Cached TileLink between L1 D$ (only in scalar processor) L2$ maintains directory bits, which determines the states of corresponding cache line (JW: maybe something like MESI bits) Operations of L2$, supported by TileLink protocol (“Productive Design of Extensible On-Chip Memory Hierarchies”) Sub-cache-block accesses Data prefetch requests Read from DDR, don’t need to send back to the requester Atomic memory operations ALU inside L2 cache banks Decoupling Access/execute decoupling Decoupled vector arch Cache refill/access decoupling Vector Command Queue (VCMDQ) Instruction fetch is handled by scalar processor, and then sent to VCMDQ There is explicity defined start vf and stop vstop instructions that flags the begin and end of vector instructions JW: why is that necessary?
Ariane Document
Architecture note PC gen stage The fetching address for i-cache is always word-aligned. Fetch stage Its fetch stage doesn’t have much decoding work to do, only the necessary one to generate next PC. And it relies on its branch prediction to give out next PC.
There is an internal FIFO with 2 entries to log the PC (and other meta-info) that was sent to i-cache, while waiting for its response.
NoC Clustering coefficient: the most intuitive explanation is the number of hops between two random nodes in the network. Layers Physical layer Link layer Transaction protocol: such as AXI Seperated channels like AXI, to avoid dead-lock caused by depency problem Transport layer Packet: header & payload Flow control On link-level or end-to-end level Link-level: every hop there is a notification, just like valid-ready protocol End-to-end level: every transaction of data from sender to receiver has to have some kind of notification that the receiver notify the sender it has received the data successfully.
Coherence mechanism Snooping Every cache maintain its own cache state. And when it needs to access a shared address space, it sends snooping messages to all the other caches to either update or invalidate them.
Write invalidate: write operation will invalidate all the other shared copies. Others will have to read again from the next level of cache to use it again. Write update: write operation will give the written data to the shared copies and update them accordingly.