Some of My Ph.D. Projects in Loongson, ICT, CAS

High-Speed full-custom register file design

From Mar. 2004 to Jan. 2006

This whole project’s purpose is to design high speed multi-synchronious read and write port register file to meet the ultra high frequency and bandwidth requirement of a 1GHz 4-issue 64-bit general purpose RISC CPU. We need to implement a SRAM with independent 8 read ports and 4 write ports with read latency less than 500ps. Our solution is to use 2 SRAM with identical data content. Each of them have independent 4 read ports and 4 write ports. At RTL level, we write both of them with the same address and data. Also it’s guarranteed that read address and write address will not be the same in one cycle.

The hardest of the design is read latency. Normal SRAMs with differential sense amplifier will need 2 read bit lines for each read ports. It’ll take too much area for 4 read ports and 4 write ports. Larger area means longer word lines and bit lines, which lead to larger read latency. And differential sense amplifier is not fast enough to meet 500 ps read latency. We have 2 choices. One is to use pre-charged style sense amplifer, the other is an analog sense amplifer based on current mirror. Because both of them are related to some patent, so details won’t be disclosed here. The former one is simpler and more suitable to small SRAMs, but the later one is faster with much higher power consumption as well.

My role in this project has been changing from chip to chip, since we have design several versions for several chips using different CMOS process. The reason that we hop among different fab and different technology node was management decision. They have been trying very hard to achieve 1GHz frequency target. I have to say these changes really added too much unnecessary work to the custom design team, although it actually help me a lot to focus on one single project, learn from elder engineers and finally took much more responsibility in the project.

In the first chip, we used the 130nm IBM CMOS process. My role in this version was writing Tcl script to extract data from Nanosim simulation results and create Liberty file that descript the SRAM’s timing and power character.

In the second chip, we used 130nm Chartered CMOS process. In this version, I took more responsibility that includes: (1) improving and maintaining SPICE-level simulation environment to verify function verification and extract timing and power numbers; (2) doing pre-layout and post-layout simulations to verify function and timing; (3) drawing layout for read and write address decoder. The most important jobs were function and timing verification, but the most timing consuming job was drawing layout totally manually. (Actually it would be better to implement these digital decoders with synthesis, at least with some custom design routing tool. Altought this job improved my understanding with layout.)

In the third chip, we used 90nm TSMC CMOS process. Because we have changed from 130nm to 90nm, we have to shrink the design and redo all the verification work all over again. One of the senior engineer left us before this version started, so I took his position and took most of the responsibility including: (1) shrink and verify design in pre-layout simulation; (2) plan and balance the clock; (3) improve read and write latency by sizing transistors; (4) floorplan; (5) draw the layout of sense amplifier; (6) verify function and timing with post-layout simulation; (7) extract timing and power numbers and create Liberty library database; (8) help to meet the project schedule.

CPU-to-CPU optical interconnect evaluation

Oct. 2016

This project started from a though of my professor that on-chip optical interconnection will change the chip design dramatically in the near future. He got this idea from a presentation of another assistant professor. So I was assigned to do some research in this area.

After a lot of comunication with the assistant professor who had the original thought, and reading lots of papers, I made a conclustion that on-chip optical interconnection was far beyond the technology at that time, and it would take at least 5 years before the industry to adopt this technology. (It’s been 7 years later, and still on-chip optical interconnection is the idea in the lab.) However the ultra long distance optical interconnection was very mutual, and even board-to-board optical interconnection was widely used.

Optical interconnection was facing a difficulty that electrical signals have to be converted to optical signals and then backward. This will increase the cost a lot, so it’s not cost efficient to do near distance interconnection with optical. Another difficulty for on-chip interconnection was CMOS technology is not a good candidate to implement optical laser (electrical to optical) and diode (optical to electrical).

In this project, I didn’t have much resource. Most of the time I worked alone and I didn’t have a lot of budget. So my strategy in this project was to build some chip-to-chip optical interconnection prototype system, which aiming at low latency, large throughput CPU-to-CPU communication.

  1. Physical layer. Furtunately, some of my colleague in the same institute had already done a similar project with a Xilinx FPGA and a pair of optical transmitter. These optical transmitter was driven by on-FPGA high speed serial port, RocketIO. Altough I have done some work on implementing a prototype myself, I finally decided to use this board and focus on RTL design and followup reasearch.
  2. Protocol. Because we were designing this for CPU-to-CPU communication, so the protocal should be low latency. Therefore we could not use normal multi-layer protocol, and had to design our own with only one or two layers and they had to be very simple to achieve high frequency at the same time. This protocol include: (1) inter- and intra-byte alignment, because there was serialization; (2) 8B/10B encoding, because there was CDR (clock data recovery) at the reciever side; (3) error control scheme, because there was error rate on optical communication that cannot be ignored.
  3. BIST. For debug and test purpose, I designed a BIST module that emulates the CPUs’ behavior as both transmitter and receiver. The transmitter side includes a 64-bit PRBS (pesudo-random bit sequence) data generator as well as an error insertion module which is also based on PRBS and can pesudo-randomly insert single-bit errors into the data to test the error control scheme. The receiver side includes a data checker that can also output the error rate numbers to on-board LED lights.

This paper, “An Efficient Error Control Scheme for Chip-to-Chip Optical Interconnects”, is focus on the error control scheme in the protocol. Normally in long distance interconnection, CRC check and re-send strategy is used. But in optical interconnect, there are 2 major differences in error rate: (1) the error rate is lower than electrical interconnect; (2) the error normally happens on randomly single bits, rather than continuous bits in electrical interconnect. Therefore, after some experiments and calculation, I proposed to use ECC instead of CRC as the error control scheme in optical interconnects, and it is much area, power and latency efficient.